Key points for developing training

July 12th, 2015

The the ISMB workshop for Education in bioinformatics Gabriella Rustici gave an interesting talk on her “Top 10 tips for setting up a bioinformatics training course”.  I agreed with pretty much all of the points and she covered a lot of the types of issues raised elsewhere (make course objectives clear, use well supported software, follow up to assess effectiveness etc).  I hope we’re considering these types of points already in our training, but it also led me to think about what I would have added to the list.

Here then are my points to consider:

Set pre-requisites, but don’t expect people to take any notice

One of the hardest things about setting up a training course, especially a short course, is deciding on what knowledge you can require from the participants so that you can focus more specifically on the topic you want to teach.  We have tried to arrange our training with a modular structure, where one part follows on from another so that students can follow a smooth path from one module to another with minimal overlap to maximise the teaching of new concepts.  Having gone to the effort of doing this we’ve discovered (as I’m sure that plenty of other people have) that people turn up to the course without the pre-requisite knowledge you expect.  We’ve found that there are a number of potential reasons for this:

  1. People don’t understand what the pre-requisites mean.  Putting something like “knowledge of R” as a pre-requisite leaves a lot of room for interpretation. You might mean that people are comfortable starting an R session, loading in some data, filtering it, then plotting and saving a simple graph. A user might decide that because they know what R is and because they sat with someone else and ran a pre-prepared script that they have “knowledge of R”, but they will almost immediately get stuck when you try to get them to run exercises.  The answer here may partly be to set more specific requirements (you must know how to…) but that also makes your course requirements seem complex and could put people off.
  2. People mis-estimate their own abilities.  This is a variation of 1, but derives more from people having followed through exercises in a previous course, but aren’t actually able to use the software independently. There is a big difference from someone having been on a course and having had them actually use the material and understood it to the point where they can use it fluidly in order to focus on more advanced concepts.
  3. People ignore pre-requisites and expect that they can pick up what they need as they go along.  Sometimes this is because the pre-requisites don’t sound daunting, sometimes it’s because they want to get a feel for the topic and don’t care if they can’t actually do all of the exercises and sometimes its because the scheduling of the different modules would mean that if they wanted to follow the expected set of modules that they wouldn’t be able to get the training they wanted for several months.

So what can we do about this?  Some suggestions.

  1. Plan on having people not meet the pre-requisites.  Don’t slow down the progress of the course for the people who do meet then, as that’s not fair to them, but find a way for the people who don’t meet them to have a useful, if sub-optimal experience.  Strategies we’ve used have been to have very short cheat sheets for pre-requisites which explain basic concepts and actions/syntax which we can hand out to people who are struggling.  They will move more slowly but should be able to make some progress.
  2. Have pre-run or easy to run alternatives to completing the exercises.  Some people won’t care that they can’t run the actual exercises but do want to be able to see the process and know how to interpret the results.  We try to provide pre-written scripts which the users can use if they prefer (although we point out that this isn’t our recommended way of doing the course).  We also have pre-computed results files for all exercises so users can skip over parts if they want (and these are also really helpful if the exercises fail for any reason too!).
  3. Make sure that your course dates are planned so that people can quickly flow through a number of modules.  When you’re planning out a season of courses start with the basic ones and work up so it makes it easy for people to do things the right way.

Find people who use the software / techniques to do the training

In my opinion the hardest part of organising training is finding the right people to prepare and present it.  For scientific training I am strongly of the opinion that the people writing and running the courses should be doing the same work on a routine basis as part of their regular job.  If all a training course is is an effective presentation of pre-prepared material, then there is very little reason for doing in-person training – an online course or video tutorial will do just as well.  What makes a truly effective course is having a training who is intimately familiar with the material they are presenting who can step off the regular script to address points or concerns raised by people on the course.  Why do you need this?

  1. Courses can’t cover everything, and people will often come with some very specific questions which they’d like answered and which probably be explicitly covered in your material.  Often a couple of minutes with someone who works in the area will be more valuable to a participant than the rest of the course, and will radically alter people’s perception of how useful the course has been.
  2. People often don’t get an explanation the first time.  You may have spent hours preparing the prefect explanation of a particular concept, but some people just won’t get the point the first time around – and this is not their fault.  A trainer who really understands the core material will have no problems grasping someone’s mis-conception and explaining the concept another way (or several ways) until they get it.
  3. Personal practical experience is hugely valuable – not only covering what you should do, but having relevant examples of why you should do it like that will help convince people that the training you’re giving them is relevant.  I’ll come back to this as it’s such a key point.

Finding these people is not easy.  Some people are talented data analysts, some people are good teachers and the overlap between the two is small.  One out of the two really isn’t enough and will result in a poor experience.  Often it’s not the people who are leading the field who are the ideal trainers, they can be too close to the material, but instead  finding people who work with the right kind of data but in a broader way can be better.  One idea we’ve played with is pairing a scientist and a trainer to get the trainer to suck the useful information out of the scientist and format it in a way which is amenable to training.  This doesn’t mean the scientist isn’t needed when the training is actually presented – they definitely are – but doing it as a collaboration can be feasible.

Write pessimistic training

OK, so pessimistic is probably a bit harsh, maybe “realistic” would be better.  One of the biggest problems encountered by people who’ve been on a training course is that during the course they are given clean, santitised, well behaved examples which are ideal to illustrate a point.  When they go back to their labs and try to analyse their own data they quickly find that the nice pipeline they ran in the course doesn’t work so cleanly on their data and they quickly get stuck.  They might hit an error message, they might get no results, or in the worst case they get ridiculous results back, but don’t realise and waste their time following up something which isn’t real.

When I advocate pessimistic training what I mean is that when teaching a concept you can start of by explaining the theory and showing an idealised (standard) version of an analysis.  You can even use this as the example you actually run.  However – as you go through both the theory and the practical aspects of the course you should always try to explain how the idealised version can go wrong.  Really data is large, ugly and messy.  Anyone who has done any real analysis will know that there are some reasonably common ways in which things can go wrong, and they can usually spot them when they do the analysis themselves.  This practical experience is hugely useful to people you are teaching, so when teaching always try to cover:

  1. What the expected result is
  2. What are the common ways in which it can go wrong
  3. Ways to visualise / QC / interpret the data to spot the failure cases.  Ideally this should be accompanied by illustrations from real bad data so it’s clear that this isn’t just a theoretical concern
  4. How to deal with different failure modes – can they be rescued or just abandoned

This type of training emhasises my second point about who does your training – only people who routinely work with the right kind of tools and data and are critical in their evaluation can provide the kind of useful additional points which give practical strategies to students.

Update your training – or delete it

Science has an annoying habit of continually advancing.  Data gathering techniques change, data volumes expand, statistics improve, new artefacts and biases are discovered.  The most interesting and requested scientific training courses are on scientific areas where there isn’t a complete and final consensus on the optimal way to run the analysis, so good training should reflect this.

The first way in which this should happen is that courses should be very up-front about the strengths / weaknesses of the approaches they advocate, and should try to at least give a basic presentation of the alternatives.  For most software and training I am strongly in favour of opinions.  Trainers should have an opinion, but this should be based on knowledge and they should be aware of alternatives.  People on a course want practical guidelines, but they also want these put into context so that as the field develops they should be aware of the alternatives so that if your opinion changes over time then this shouldn’t surprise them.  You should also be up-front when you think that the approach that you’re advocating isn’t optimal and will likely be improved upon.  Sometimes there isn’t a solution you’d really recommend, but you can still show the best current option and point out why it’s not optimal.

To follow on from this, the other requirement is that courses should be updated and continually under review.  I don’t think we’ve run our RNA-Seq course more than twice from the same revision, because something has always changed (software updates, new software comes out, new library preps are invented etc), and I’m certain our methylation course has changed every time we’ve run it (it’s an immature, but interesting field).

I’d even go further and say that if you have a course which you are no longer running or actively updating that you should get rid of the material you put out for the course.  If you wrote an RNA-Seq course 2 years ago then you shouldn’t be passing the information in that course on to people who are generating samples today.  The field has moved on and there are almost certainly better analysis options than the ones you espoused then.  Even if you’re not running the course any more, if your material is publicly available then people will still find it, and they won’t check the date on it and will use it.  Training material probably ages even worse than software and there is an increasing amount of material available online, most of which isn’t being updated.  We should realise that our material has a shelf life and not leave it around when it starts to smell bad.

Give students a lifeline for after the course

One of the standard things we do with all of our courses is that at the end we give our our email addresses.  We also set aside time to answer questions from students.  These don’t often take very long – people get stuck on fairly simple things, but it can make a huge difference to their experience.  In our case we have a consultancy service, so that if people really decide they don’t want to do their analysis themselves we can give them an option, but even knowing sources of potential collaboration or forums to turn to is a valuable service.

Posted in Bioinformatics | 1 Comment »

Diagnosing problems in SeqMonk R scripts

May 7th, 2015

The major new feature in SeqMonk v0.30.x was the introduction of a bridge between SeqMonk and R, which allows the program to transparently run analyses in R from within the SeqMonk interface.  This functionality makes it very easy to run popular algorithms such as DESeq and EdgeR without having to write custom scripts or go through data export / import operations.

Unfortunately adding these functions also introduced a new source of failure into SeqMonk – ie that something could go wrong within the R process itself.  Whilst the error will be spotted by SeqMonk and can be reported through the usual crash reporter, because the error wasn’t within SeqMonk’s code base it’s often very hard for us to track down the exact cause from the details provided in the error report.

Providing better error reports

If you get a crash report which says something like “R failed to complete (exit 1)”, please do send it in as we want to know how often this is happening.  In order to allow us to diagnose this fully it would be useful if you could send us some more information as well though.

Specifically, when SeqMonk runs an R script it generates a temporary directory in your cache folder.  You will have set this up when you first installed the program.  If you can’t remember where you put it you can go to Edit > Preferences and look under the “Memory” tab at the location under “Cache reads to disk”.

When you have an R problem the temporary directory is left in the cache folder.  If you close SeqMonk you should see a folder left which is called something like “seqmonk_[some numbers]_tempd0”.  Inside this folder will be the data and script files for the R process which failed.  The exact file names will vary depending on what function you were running, but you should always find a file called script.r and another called script.r.Rout.  The file which can tell us what happened when R failed is in script.r.Rout, if you submit an R related crash report can you please locate this file and then email that to and we’ll do our best to figure out what when wrong in your specific case.

Tags: , ,
Posted in Uncategorized | Comments Off on Diagnosing problems in SeqMonk R scripts

Merging STDIN and STDOUT in a gridengine submission

July 8th, 2014

We recently hit a problem when trying to run a fairly simple script through grid engine.  The script used an internal redirection to merge together STDIN and STDOUT before doing a final transformation on the result.

A simplified version of what the script did is something like this:

echo hello 2>&1 | sed s/^/prefix:/


Which should put “prefix:” in front of any output from either STDIN or STDOUT.  Running this from the command line worked, but trying to run it through qsub failed.

$ echo "echo Hello 2>&1 | sed s/^/prefix:/" | qsub -cwd -V
Your job 201149 ("STDIN") has been submitted
$ ls -tr
$ more STDIN.e201149 
Ambiguous output redirect.


After many experiments we found the cause for this, which is actually pretty simple, but since it didn’t turn up in a google search I thought I’d document it.  The reason this type of redirection fails under qsub is that the default shell used by qsub is csh rather than bash, and the normal 2>&1 notation is a bash specific extension.  You can therefore make this work by either using csh redirection, or specifying bash as your preferred shell.  Either of the commands below would work:

echo "echo Hello |& sed s/^/prefix:/" | qsub -cwd -V
echo "echo Hello 2>&1 | sed s/^/prefix:/" | qsub -cwd -V -S /bin/bash



Tags: , ,
Posted in Bioinformatics, Computing | Comments Off on Merging STDIN and STDOUT in a gridengine submission

Review of FREEneck saxophone harness

February 8th, 2014

Update 6 Jan 2016: The people who make the freeneck harness have just released a modified version which addresses some of the problems I found in my original review – specifically the long rods down the back now come apart which should make it easier to alter the length, and also means that you can now splt the harness in two for storage so it should fit into the bell of your sax.  The pictures of the new model can be found here.


I’ve recently started to do a lot more playing on my baritone sax and had been finding that the neck strap I’d been playing on had been causing me to end up with mild but persistent back pain.  Nothing too serious but something I’d rather avoid if I could.  The strap I had been using was quite a cheap one I bought sight unseen when I first got the sax – it was one of those which went around your neck and under one arm.  This ends up putting quite a bit of lateral pressure on your neck so I’d been thinking about changing this for a while.

Since I wasn’t convinced that any conventional neck straps would be ideal for a bari I’d started looking at harnesses instead.  I’d borrowed a NeoTech harness from a friend to try out and although it make holding the bari a lot more comfortable, I didn’t like the restricted movement I got from the clip being tightly held against your chest.

I spent quite a while looking through various forums for experiences from other players who had looked at this before and the harness which sounded most like what I was after was the FREEneck harness from ERGOnomic systems (they’re big fans of MIXed CAse WORds apparently). The appeal of this system was that it was supposed to put no pressure on your neck at all, and it also had a conventional string and hook arrangement which should allow the same freedom as a conventional neck strap.


Unfortunately I couldn’t find a supplier for this harness in the UK, so I ended up ordering it from Thomann in Germany.  I’d ordered a couple of things from them before and they were as efficient this time as they had been before and the harness turned up a couple of days later.

The first thing thing to say about this harness is that it’s not cheap.  It’s not the most expensive harness I’d seen, but it’s pretty close at €105.  I was especially nervous as I’d have to buy this without being able to try it but decided to take the plunge with the option of selling it if I didn’t like it (you can’t buy them in the UK so someone must want it).

How it works

freeneck harnessThe harness itself comes in 3 parts.  The main bit is a normal looking neck strap string and hook from which projects a pair of long metal rods which extend down your back.  The end of these then connect into a pocket on a separate belt which clips around your waist.  Finally there is a pad which fits over the rods to provide some extra support.  The pad was listed on the box, but was actually missing from the harness I received. I’ve not yet contacted the seller to get a replacement sent out – I probably should but in all honesty I’m not sure I’d use it even if I had it.

Setting up the harness

freeneck barsThe harness comes with a single sheet of instructions which include both an English and German translation.  They give a very brief overview of how to set the harness up but although they say what you can adjust on the harness they don’t give a very clear impression of how it should sit and feel when it’s set up correctly so you’ll have to work this out for yourself.  The basic idea for the system is that the neck strap part of the harness doesn’t actually touch the back of your neck – instead it sits a couple of centimeters behind.  When you attach a sax to the hook the weight pulls on the neck strap but this is prevented from coming forward by the rods which run down your back.  These in turn sit in the pocket on the belt so the whole system pivots around your back so the weight is spread generally over your back and hips.

The major adjustment you need to make is to the length of the metal rods.  These come in two parts, an upper and lower section, which are separated by long threaded screws which you can lengthen or shorten to get the rods positioned correctly.  The screws are shipped set to a very short setting (to make the harness smaller to transport) so you’ll need to unscrew them a long way if you’re tall like me.  I was worried that there wasn’t going to be enough thread to extend to my height but the screws are actually really long and there’s plenty of slack in them.  One annoying design feature is that to extend the screws you have to completely detach the upper and lower parts of the harness.  This means that you end up spending quite a lot of time unscrewing and re-connecting the two halves to get the length correct.  This is annoying, but you’ll only do it once so it’s not such a big deal.  The other adjustment you can make is that you can bend the top part of the bars to fit the curve of your back.  Be careful when doing this though – I fitted mine so it was comfortable and then found that when I hung a sax off the hook the shape changed so that it was pressing on my neck, and I had to adjust it back again.  I found that the final setup I ended up with had the neck strap being a few centimeters off my neck with no sax attached, and almost (but not quite) touching my neck when the sax was present.

Playing comfort

freeneck hookGenerally the harness is pretty comfortable to play on.  You need to make sure that the pocket on the belt sits right in the middle of your back so that the bars don’t press unevenly on your back.  You’ll also need to make sure the belt sits low enough that it doesn’t end up digging into your stomach which is both uncomfortable and can restrict your breathing.  The belt itself needs to be quite tight so that the bars can’t move out much from your back and it’s not always easy to stop it riding up over your hips.

The harness doesn’t put any pressure on your neck, and your head is free to move around naturally.  However you feel pressure down your back, especially between your shoulder blades where the bars pivot between the belt and the neck strap.  This isn’t uncomfortable as long as your back is reasonably straight, but if you start to slouch then you’ll feel the bars starting to dig in – this is a harness which encourages good posture!

There are also a couple of other places where the harness puts pressure.  The most noticeable is that the strings from the neck strap press down onto the front of your shoulders and can dig into your collar bones.  The two strings are separated by a long plastic tube, which stops the strings coming together around your neck, but also allows the plastic tube to dig into your chest.  Both of these issues are only really present when you’re holding, but not playing the sax and once you move it into a playing position they sort themselves out and the harness is actually very comfortable.

The big advantage of this system over other harnesses is that it uses a conventional string and hook arrangement so you have a lot of flexibility to move your sax around to your preferred playing position.  There is normal height adjustment on the string and this has a lot of slack in it so I could comfortably use the same harness on all of my saxes from a short alto to a long baritone setting.

One thing I did find is that it seems that this system works best when you’re playing standing up.  This is probably where a harness is most required, and I’ve played for extended periods whist standing with no problems.  If you’re sitting down then it’s not quite as good.  The belt is harder to keep in position and tends to ride up more when can cause the other parts of the harness to move out of alignment.  It’s not uncomfortable to use sitting down, but if you play a lot standing up then it works really well.

Build Quality and Practicality

freeneck wearingThe harness seems to be very well built.  The materials and quality of finish seem to be very good.  The only concern I had was the hook – this is an all plastic variety and although I’ve had no problems and I’m sure it’s up to the job it doesn’t look very solid.  The hook on my harness was quite stiff and the opening when you pull back the pin isn’t very big so it might not be the best if you have to do a lot of quick switches of instruments, but this might free up with more use.

One big potential drawback with this harness is its size.  It’s pretty big – it has to be at least the length of your back, and it doesn’t collapse for transport or storage unless you want to completely unscrew the bars each time (which you won’t).  There’s also a rigid panel at the back of the neck strap which limits how small you can fold this part.  What this means is that you’re not going to be able to keep this in your sax case.  I couldn’t even get it to fit down the bell of my bari (the bars would fit, but the neck strap was too wide to sit in the end of the bell).  I’ve got an enormous gig bag which I take around and it goes in there (just) but if you’re used to just carrying a sax case around then this is going to end up being an extra piece of luggage.

Once you’re wearing the harness you can’t really see it from the front.  You should be able to put it on under a suit jacket or DJ and you can take the bars off over your head so you don’t have to commit to wearing the harness between sets if you’re doing a long gig.  If you’re wearing it on top of your normal clothes then you can see it running down your back, but it’s not at all unsightly and I’d be happy to wear it like this.


Overall I like the FREEneck harness.  It’s a lot more comfortable than the strap I had before and my back hasn’t been hurting since I’ve been using it.  I’d be quite happy to play for extended periods on this, even if I had to be standing.  The major downsides of the harness are the cost and the practicality of carrying the thing around, but it could be a good solution for a lot of players.

Tags: ,
Posted in Music | Comments Off on Review of FREEneck saxophone harness

Fast subset selection by row name in R

December 5th, 2013


One of the best features about R is the simple way you can use a number of different strategies to create subsets of large data tables.  The basic selection mechanisms you have are that you can subset a data frame by providing:

  1. A set of column or row indices
  2. A set of boolean values to specify whether individual columns or rows can be included
  3. A set of row or column names to include in the filtered set

Mixing these basic mechanisms with the plethora of other functions R provides to generate these data types gives you a simple way to create powerful filters to reshape your data.

The Problem

Anyone who has used R for a significant amount of time will be painfully aware that although it’s syntax can be very powerful, the implementation of some of its functions is often not very efficient.  I had always assumed that the core selection and filtering tools would be very efficient though.  It turns out I was wrong.

Specifically I came across a case where I wanted to make a subselection of rows from a very large data frame (~500k rows) based on the row names.  I therefore tried what seemed to be the obvious solution, which was to do a rowname based lookup.  This it turns out is very slow.  You can see the problem using this simple example below which selects 2500 random rows from a 1 million row data frame.  On my machine the last line takes around 20 seconds to complete, and the completion time seems to scale linearly with the number of random rows selected. <- data.frame(values1=rnorm(1000000),

rownames( <- paste("row",1:1000000,sep="")

sample(row.names(,2500) -> random.rows[random.rows,] ->

The Solution

The limiting step in the slow code above seems to be the actual lookup of the list of names against the rownames in the data frame.  Lookups by index position do not appear to be slow, so you can work around this by doing the lookups yourself manually before running the filter. You can easily do this with the match() function which will take in a vector of strings and will give you the index position of each one in a second vector.

This means we can change the filtering part of the original script to be:[match(random.rows,rownames(,] ->

..which completes in less than a second.  In this case the lookup in the data frame is done based on the indices returned by match so the match function and the lookup within match appears to be implemented much more efficiently than the core code in the data frame filtering.

There may be some added functionality in the core data frame lookup code of which I’m not aware but it seems odd (and bad) that the natural way to do this kind of lookup is so inefficient.  I’d have expected that the core lookup would have been doing what match does behind the scenes and should be able to return a result just as efficiently as with our modified solution.  Hopefully this can be addressed in future, but at least for now there is a fairly simple work round which can make things much quicker.

Posted in Bioinformatics, Computing | Comments Off on Fast subset selection by row name in R

Why is java not found on the command line after I’ve installed it?

September 9th, 2013

The Problem

A common problem we have reported from users of SeqMonk is that when they try to launch the program on a Windows system they get an error message saying that java could not be found, even though they have installed this and they can show that it works fine in their browser.

When SeqMonk looks for java it’s actually testing whether the command shell has been configured such that running “java -version” returns a valid java version identifier.  If you run this on a system with java correctly installed you should get something like this:

C:\Users\andrewss>java -version
java version “1.7.0_13”
Java(TM) SE Runtime Environment (build 1.7.0_13-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)

However on a machine where java has not been correctly installed to the command line you’ll get something like this:

C:\Users\andrewss>java -version
‘java’ is not recognized as an internal or external command,
operable program or batch file.

The Reason

If you’ve installed java and the installation completed successfully with no errors, but you still can’t access the java program on the command line then the most likely cause is that there is a mismatch between the version of java you have and your operating system.  Most PCs these days ship with a 64-bit operating system, and this is able to run both 64-bit and 32-bit programs.  Java also comes in 64-bit and 32-bit versions, and both of these can run on 64-bit windows (32-bit windows can only run 32-bit java).

If you are using java to run local programs then you generally want to run the 64-bit version of java since it gives you access to more of your system’s memory and includes other efficiencies.  For historical reasons though Oracle don’t link to the 64-bit verison of java on their default download page, so it’s very easy to be led into installing the 32 bit version of java.  For some applications this won’t matter, but as secondary consequence of this is that installing the 32-bit version of java on a 64-bit windows system will not install the java program onto the command line, so although it will work in a browser it won’t work in your command shell and programs such as SeqMonk won’t find it.

The Solution

The quick fix for this problem is to install the 64 bit version of java.  You can install this alongside the 32 bit version and the can coexist happily so anything which does need the 32 bit version (some browsers still might) can still use it, but the 64 bit version will then work on the command line and you will be able to access all of your system’s memory in the java programs you run.

The path you need to take to get this to work is shown below:

Java Front ScreenFirst, go to, and click on the link which says “Free Java Download”







See all Java DownloadsOn the next screen ignore the link which says “Agree and Start Free Download”, but instead look for, and click, the smaller link below which says “See all Java downloads”





Java 64 DownloadFrom the list on the next page you should find and click the option which says “Windows Offline (64-bit)”.  This will download the correct version of java which, once installed, should allow you to run command line java programs correctly.



Tags: , ,
Posted in Computing | 1 Comment »

A new way to look at duplication in FastQC v0.11

September 3rd, 2013


After a long gestation we’ll be releasing a new version of FastQC in the near future to address some of the common problems and confusions we’ve encountered in the current version.  I’ll write more about this in future posts but wanted to start with the most common complaint, that the duplicate sequence plot was difficult to understand and didn’t really show the information people wanted to see.  I wrote a post previously to explain how the current module works but in the new release this module will be significantly restructured to give a more comprehensive, relevant and hopefully intuitive view of the duplication within a library.

What was wrong with the old version?

A few things actually…

  1. To keep memory usage sane we had to sample the library rather than count all sequences.  This meant that what we showed in the plot was the relative number of sequences within the sample set with different duplication levels.  Because there wasn’t enough space to show all possible duplication levels we had to set a limit at 10, but this meant that the plot couldn’t accurately reflect the level of duplication just by looking at the graph, because you couldn’t tell whether the sequences in the 10+ field occurred 11 or 11,000 times.
  2. Because we hadn’t found a nice way to extrapolate from the subset of data we monitored to the whole dataset, the y axis for the plot expressed duplication levels relative to unique sequences.  This was easy to do but was difficult to interpret, and was pretty misleading in cases where there were virtually no unique sequences in the library.
  3. We included an overall duplication value at the top of the plot, which was a percentage value, but again this value was difficult to interpret since the value shown was simply the percentage of sequences in the library which were not unique.  What this meant was that a library containing the same sequence in every read, and a library with diverse sequences which were all present exactly twice would both have a duplication level of 100%, which wasn’t very good.

What have we done to fix this?

The new version of the plot still retains some of the properties (and limitations) of the old method.  Basically, the way in which the initial count data is collected hasn’t changed.  We still sample the library so that we don’t require huge amounts of memory, and we still only look at the first 50bp of any library over 75bp so we don’t get messed up by base calling error.  It’s the way we treat and represent this data which has changed.

The inspiration for the new version of the module mainly revolves around an improved way of extrapolating from our limited dataset to the rest of the library so that we can accurately predict what we would have seen if we had actually measured the whole library.  The maths for this was kindly worked out and written up for us by Brian Howard of, to whom we’re extremely grateful, and we’ve adapted his approach in the new version.

The new plot has three major changes to the old one.

  1. The number of x-axis categories has been increased to expand the 10+ category into several groups to give you a clearer picture of the true level of duplication in your sample.
  2. The y-axis of the plot now represents a percentage of the total library, so all values are directly comparable and much easier to understand.
  3. The plot now contains two lines instead of one.  Both of these profiles show the proportion of the library which comes from sequences with different levels of duplication.  The two traces simply show the percentages of the raw and deduplicated libraries.  By looking at the two traces you can quickly see how the library is divided between different duplication levels, and by comparing the two traces you can easily see whether the 10+ category is composed of a small number of very highly duplicated sequences, or a much larger number of moderately duplicated sequences.

In addition to the changes to the plots we have also changed the headline duplication level value.  This is still shown as a percentage, but the value now shows what percentage of the library would remain if you deduplicated it to keep only one copy of every different sequence, which fits a lot better with how people actually think about duplication.

So is it now perfect?


OK, no.  But it’s much better.  There are however some problems which aren’t solved (and may not be solvable).

  1.  The sampling of the library is still based on the first N different sequences we see (where N is currently 200,000 but in the new scheme it doesn’t really matter too much what it is).  This approach relies of seeing the sequences in a random order which is fine for fastq files, but you can still get misleading answers if you feed a sorted BAM file into the program.  This isn’t really something we can solve since the only guaranteed solution involves holding every possible sequence in memory, which is bad.
  2. When setting warn / error limits we have to make assumptions about the library which aren’t really true.  We assume that we’re sampling from an infinitely sized genome, and that no duplication is always better.  In reality you might be sampling very deeply from a small genome where you’d expect a certain level of duplication, and where lower duplication levels might be just as worrying as high levels.  Alternatively you might be sampling in a way where even representation isn’t expected (such as RNA-seq libraries where representation reflects expression levels).  In theory we could address the first case if we were given a genome size, but the second case is very difficult unless we make some big assumptions about the expected distribution of expression levels.  This might be more of a problem if library sizes got much bigger, but it’s not too bad at the moment.
  3. All of the duplication detection is based on an exact match between two sequences and even a single base change is enough to call the sequences as different.  In libraries of poor quality, which contain a larger percentage of sequencing errors, the duplication level will be under-reported as the sequencing errors will be incorrectly interpreted as diversity in the library.  This might be theoretically solvable with fuzzy matching, or a quality aware algorithm, but both of these would be significantly more computationally intensive than the current approach and would slow down the whole program – something we’re trying very hard not to do.

Can I see an example?

Of course.  Let’s start with a nice library and work our way down to something nasty.


Example 1 – Very diverse library

low_duplicationHere we have a library with low duplication.  In both the raw and deduplicated versions of the library the vast majority of reads come from sequences which only occur once within the library.  This suggests that the library sampling isn’t anywhere close to saturation and that you have a diverse population.


Example 2 – RNA-Seq

rnaseq_duplicationThis is a good RNA-Seq library, but actually represents the kind of profile you might see in any enriched experiment.  In general FastQC makes the assumption that you expect to be looking at a diverse library with a roughly even representation of all sequences.  In enriched libraries this isn’t going to be true and there are certain sequences (highly expressed genes or enriched peaks in a ChIP) where you want the sequence to occur more frequently.  Above a certain read density the only way to place more reads into a region is via duplication so we might not expect to have a properly diverse library.

What we can see in this case is that we have lots of sequences with low duplication levels (2-9 copies) which probably cover most of the ‘normal’ genes we’ve measured in the library.  Then there are a bunch of sequences with moderate duplication (10 – 100 copies) which may cover things like rRNA and some repeat sequences.  There are very few highly duplicated sequences which might be more indicative of a technical problem with the library.

SeqMonk will probably throw a warning or error from these libraries since fairly large chunks of the library will be lost were it to be deduplicated, but the profile seen here actually isn’t too bad for RNA-Seq.  If anyone has a suggestion for a better generic metric to use than the amount of loss upon deduplication then we’d be happy to hear suggestions.


Example 3 – PhiX

phix_duplicationIn a PhiX control lane you expect that all of the PhiX sequences will be present thousands of times, so we don’t expect this to look pretty.  From the blue line we can see that in the raw library the majority of the library is composed of sequences which are present 1000 – 5000 times, with very few sequences from the other categories.

Interestingly we can also see that when we deduplicate the library that the vast majority of sequences in the deduplicated set would come from reads which were present only once in the original library, suggesting that the deduplicated set would be dominated by sequencing errors or contaminants.  This is maybe a good indication of why deduplicating your data as a matter of course is a really bad idea.


Example 4 – Horrible PCR duplication

high_duplicationThis was about the worst example I could find.  This was an RNA-Seq library which had some nasty contamination and PCR duplication.  you can see that in the raw data the majority of reads come from sequences which are present tens of thousands of times in the library, but that there is also strong representation from reads which are present hundreds or thousands of times, suggesting that the duplication is widespreads.

In this case deduplicating may help the experiment, but it causes the loss of 93% of the original data so it’s a pretty drastic step to take.





Tags: , , ,
Posted in Bioinformatics | 3 Comments »

Generating R reports with vector images from markdown with knitr

June 21st, 2013


One really nice addition to a standard R environment is the ability to create reports which combine R code, comments and embedded graphical output.  The original mechanism for doing this was Sweave, but more recently a second system called knitr has emerged which seems to be more flexible, and this is what I’ve been using

Basic Knitr Use

One of the aspects I particularly liked about knitr was that you could use simple markdown syntax to create your reports rather than having to use Latex, which I’ve never really got to grips with.  It’s really simple to create documents which mix comments, code and graphics.  Here is a simple example:


Simple test markdown for use with knitr


If you use the standard “Knit HTML” function built into Rstudio then this will generate an HTML document with bitmap PNG images actually embedded within the document source as base64 encoded data.  This is fine, but what I often wanted was the ability to generate vector images so that they could later be extracted for publication without having to rerun the script.

Having done some searches I found lots of different bits of information which described different ways to convert the various files which knitr can produce, but with wildly varying levels of success, so having figured out some nice ways to make this work I thought I’d document them.

Using Vector Images directly in your HTML

The simplest way to get vector images into your document is simply to change the type of image you use to a vector based image.  Most web browsers these days can understand SVG, and this is a standard output option for R so you can simply change your code to:


Simple test markdown for use with knitr

```{r dev='svg'}

..and you’ll now get an HTML document with embedded SVG graphics.

I did however find a problem with this.  Some of the graphs I was drawing contained a reasonably large number of points and the SVG files created were also very large which made them unwieldy and slow to open.  This is however the simplest way to do your conversion.

Using PDF as an output format

An alternative method to use would be to generate a PDF document as your output format.  Rs PDF output seems to be better structured than its SVG and PDF images for plots which were very large in SVG were tiny in their PDF equivalents.

Knitr itself won’t natively create PDFs and the easiest way I found to do the conversion was to use a program called pandoc.  Pandoc can read in files in a variety of formats and convert them into PDFs using a command as simple as:

pandoc -s input.html -o output.pdf

I found this suggestion in a few places for converting the HTML output from knitr into PDF, but there are a couple of problems with this approach.

  1. The default HTML output from Rstudio uses html encoding and pandoc can’t understand this.  In order to the conversion you need to tell the program to create xhtml encoded output.  You can do this by manually running the conversion in an R session

    which will create an xhtml version of your Rmd file which can be converted.

  2. The second problem is the pandoc will embed whichever figures you created in your initial knitr run and try to put these into the PDF.  This is fine if you used PNG files, but it can’t handle SVG files so you’ll get an error if you try to use these.  The solution to this is either
    1. Change your device type in your file to
      {r dev='pdf'}

      This will produce an html file you can’t actually open in a browser as the PDF links won’t be rendered, but pandoc will recognise them and make a nice PDF from them.

    2. Change your device type to PDF and run knitr, but then don’t generate HTML but instead run pandoc on the .md file created by knitr.  This will actually produce a more nicely formatted PDF than going from the HTML.  One problem you will find with this though is that you get odd figure legends under your plots (Figure 1: plot of chunk unnamed-chunk-1 for example).  You can fix this by putting in a proper figure caption on the R code which generates the plot
      ```{r dev='pdf', fig.cap="My first figure"}

Using these relatively simple steps you should be able to turn an R script with a minimal amount of additional formatting into a professional looking document with publication quality figures embedded into it.

Tags: , ,
Posted in Bioinformatics, Computing | Comments Off on Generating R reports with vector images from markdown with knitr

Syncing calendars between iOS and android

October 4th, 2012


Having been happily using a combination of OSX and iOS for a few years now I’ve recently expanded my device collection with a google nexus 7 tablet.  I’ve been using a separate IMAP based email account for a while so this was easy to set up on the new device, but I also wanted my calendars to work.

Up until this point I’d been using the iTunes based syncing to manage my calendars between my local copy of iCal and the calendars on my iPod.  This was somewhat inconvenient since the calendars would occasionally be out of sync, but for the most part it worked well.

Now that I had an android device I wanted to be able to both view my events on my computer, iPod and tablet, and be able to create new events on any of these devices and have them sync to the others.

I read a few different howtos for how to achieve this, but thought it would be worth recording how I made it work in the end since no single guide.

Moving data to google calendar

Since all of the iCal events I had were stored in a local database on my mac, and since I didn’t intend physically connecting my tablet to my computer to sync I realised that I’d need to move my events into google calendar and use that as the central repository.  In my existing iCal I’d set up a few different calendars for work / home / other event types, but although google calendar can be made to support this, it’s really designed to have a single calendar within which you can customise the colour with which different events are shown.  I therefore decided to merge my various calendars into a single google calendar.

It’s possible to connect a google calendar to iCal using iCal account preferences and adding in your google credentials, however this enables sync in only one direction (pulled down from google), none of your existing events are synchronised to google.  You can send individual events to your google calendar by dragging them into the google entry in the calendar list, but this won’t work for large numbers of events.

The easiest way to transfer all of the events was to export each of my calendars as an ics file by selecting them in turn in the iCal calendar list, and then selecting File > Export > Export.  One you have the collection of iCal files you can go to the google calendar web page.  Under the “Other calendars” drop down there is an option to import events from an iCal file.  If you do this for each of your files then all of your existing events will be copied into google calendar.

Adding new events via iCal

Once all of your existing iCal events have been imported to google calendar you can then delete the local calendars from your iCal preferences. You can add your google calendar by going to iCal > Preferences > Accounts > Pressing + to add an account > Select ‘Google’ for the account type and then entering your google credentials.

To ensure that new events go into this calendar go to the “General” section of your preferences and select this calendar as your Default Calendar.

Syncing calendars to an IOS device

My first attempt at putting the google calendar information on my iPod was to use the iTunes sync options and selecting the google calendar I’d added to iCal.  This did indeed cause all of the events to show up in my IOS iCal, but new events could only be added to the local calendar and wouldn’t therefore have passed through to my other devices.

The correct way to do this was to add my Gmail account as an account in the general IOS settings.  Under Settings > Mail, Contacts, Calendars I chose “Add account” and then selected Gmail – some guides (including ones from google) say to use Exchange or CalDav, but only the Gmail option worked smoothly.  You simply need to enter your gmail credentials, and then once these are verified you can turn off email sync if you only want to import calendar events.  Once this was done the calendar showed all of my events, and new events which were added synchronised over to my computer and tablet the next time I was attached to a network.

I now therefore have the same calendar available on my computer, tablet and iPod with the ability to make changes on any device.  As a bonus I can also choose to share the calendar with the rest of my family so they can see my events.

Tags: , , , ,
Posted in Computing, Technology | Comments Off on Syncing calendars between iOS and android

Should you buy a nanopore sequencer?

February 18th, 2012

This morning twitter is awash with posts discussing the newly announced nanopore sequencers from Oxford Nanopore. Speculation has been rife for some time about the potential specifications of the first sequencers to be produced by the company, and it certainly appears that the company have fulfilled the expectations placed upon them.

I’m not going to go through the details of the two sequencers announced – others have done a good job of listing the specs here and here, but basically you have machines with either 512, or 2000 nanopores capable of sequencing fragments up to 40kb (but probably several kb routinely) and error rates of around 4%, mostly indels, and with promises of imminent improvements to bring this value down – all at a per-base cost similar to the best of the existing platforms.

Reading through the initial comments about this new platform my first reaction was that we have to get one (or more) of these, but after calming down and thinking about this for a bit I thought I’d have a stab at going through the use cases where this type of sequencer really makes sense.

De-dovo sequence assembly: Oh yes!

The one place where this new platform is a complete no-brainer is if you’re assembling de-novo genome sequence. Whilst Illumina sequencers can give you good coverage depth from paired end reads of around 100bp there are always regions of the genome whose repetitive nature mean that this will not provide enough context to allow a contiguous assembly. Currently you either need to start creating mate-pair libraries, which are notoriously difficult to produce, or you need to get your floor reinforced and stump up a huge amount of cash for a PacBio.  The prospect of generating reads of 10kb+ with a simple library prep should be music to your ears, and a 4% error rate with short indels should be easy to work around with a mixed assembly.

Metagenomics: Oh yes!

In the same vein as de-novo assembly the propect of longer reads should make metagenomic studies much easier.  Getting more context for your reads should allow you to distinguish between related species much more easily and assembly of mixed bacterial populations should be possible even with the slightly more limited throughput of these sequencers.

Genotyping: Possibly

I guess the main advantages of the nanopore platform for genotyping is the speed with which it can generate data.  Data collection begins almost immediately upon addition of the sample, and real-time monitoring of the data output means that you can immediately stop the run once you have observed all of the variants you are looking for.  The long read lengths should allow the illucidation of even the most complex genome re-arrangements.  The somewhat high error rates may be problematic, but if these really are mostly indels, then SNP calling might still be practical.  The per-base cost means that the current sequencers aren’t yet practical for real time diagnostic use, but future developments on this platform would seem to make this a possibility.

Epigentics: Possibly

One of the promises of nanopore sequencing was the ability to distinguish modified bases during the base calling process.  PacBio have shown that they are able to distinguish hydrox-methyl-cytosine from cytosine, and suggest that identification of methyl-cytosine is theoretically possible. In the reports I’ve seen so far Oxford Nanopore haven’t said anything concrete yet about the ability of their platform to call modified bases, but if this proves to be possible and reliable then this will become an essential bit of kit for labs working on epigentics.  The ability not only to directly read modifications directly, but to be able to put these in the context of a multi-kb fragment is truly exciting.  The addition of a hairpin structure at the end of a fragment would also allow these sequencers to read both strands of the same fragment, again providing contextual information which has so far been lacking.

It’s possible that the nanopore sequencers may still be of use to epigenetics even without the ability to read modifications directly.  Genome wide bisulphite sequencing is already being undertaken on Illumina sequencers, and should be possible on nanopore sequencers, however the bisulphite treatment itself is very harsh, and fragments the DNA sample as it modifies it, so the super-length reads able to be obtained from normal genomic DNA may be elusive once it has been modified.

 ChIP-Seq: Not really

The power of ChIP-Seq comes from the number of observations you make, not the length of those observations.  The nanopore sequencers seem to be best suited to sequencing fewer, longer fragments which would not be an advantage for ChIP-Seq.  There seems no obvious reason why short insert libraries couldn’t be sequenced on a nanopore platform, but at the moment we know very little about the overhead of starting a new sequence on the same nanopore so this may be feasible, but longer read lengths would simply reduce the resolution of the ChIP assay.  For some applications it might be interesting to monitor ChIP results in real time, and be able to halt a run once clear peaks had emerged, but in the short term I can’t see this being a good option for this type of experiment.

RNA-Seq: It depends

As with ChIP-Seq, much of the power of RNA-Seq comes from the number of observations which have been made.  To make a reasonable measurement of low-expressed transcripts then very large numbers of sequences must be generated, and the existing short read platforms will likely have an advantage in this regard for some time, so for simple quantitation of transcripts the nanopore platform may not offer huge advantages.  Where the longer read lengths of the nanopore sequencers will  be of use will be in the elucidation and quantitation of splice variants.  Current RNA-Seq protocols provide coverage of a very small part of the transcript and often do not provide enough context to determine exactly which splice variant the reads came from.  Performing relative quantitation of the splice variants of a gene is therefore not a simple process.  Longer reads from a nanopore sequencer could cover the whole length of a transcript removing all doubt about exactly which variant it was.  Whether this proves to be a useful tool for expression quantitation will depend on whether the platform is able to generate an unbiased selection of reads (or a selection with a well understood bias) to allow accurate quantitation.


So do I think we should get one of these sequencers?  Heck yes! For $900 a piece for the MinIon there’s absolutely no excuse for everyone not to get one and start playing with it to see what it can do. For much of the workload we currently have it may be that this platform isn’t going to revolutionise what we do, but if nothing else it will hopefully spur on the existing manufacturers to push forward the development of their existing platforms.  In any case the scientists win.  We live in exciting times…

Posted in Bioinformatics, Technology | Comments Off on Should you buy a nanopore sequencer?