Merging STDIN and STDOUT in a gridengine submission

July 8th, 2014

We recently hit a problem when trying to run a fairly simple script through grid engine.  The script used an internal redirection to merge together STDIN and STDOUT before doing a final transformation on the result.

A simplified version of what the script did is something like this:

echo hello 2>&1 | sed s/^/prefix:/

 

Which should put “prefix:” in front of any output from either STDIN or STDOUT.  Running this from the command line worked, but trying to run it through qsub failed.

$ echo "echo Hello 2>&1 | sed s/^/prefix:/" | qsub -cwd -V
Your job 201149 ("STDIN") has been submitted
$ ls -tr
STDIN.o201149
STDIN.e201149
$ more STDIN.e201149 
Ambiguous output redirect.

 

After many experiments we found the cause for this, which is actually pretty simple, but since it didn’t turn up in a google search I thought I’d document it.  The reason this type of redirection fails under qsub is that the default shell used by qsub is csh rather than bash, and the normal 2>&1 notation is a bash specific extension.  You can therefore make this work by either using csh redirection, or specifying bash as your preferred shell.  Either of the commands below would work:

echo "echo Hello |& sed s/^/prefix:/" | qsub -cwd -V
echo "echo Hello 2>&1 | sed s/^/prefix:/" | qsub -cwd -V -S /bin/bash

 

 

Tags: , ,
Posted in Bioinformatics, Computing | Comments Off


Review of FREEneck saxophone harness

February 8th, 2014

Introduction

I’ve recently started to do a lot more playing on my baritone sax and had been finding that the neck strap I’d been playing on had been causing me to end up with mild but persistent back pain.  Nothing too serious but something I’d rather avoid if I could.  The strap I had been using was quite a cheap one I bought sight unseen when I first got the sax – it was one of those which went around your neck and under one arm.  This ends up putting quite a bit of lateral pressure on your neck so I’d been thinking about changing this for a while.

Since I wasn’t convinced that any conventional neck straps would be ideal for a bari I’d started looking at harnesses instead.  I’d borrowed a NeoTech harness from a friend to try out and although it make holding the bari a lot more comfortable, I didn’t like the restricted movement I got from the clip being tightly held against your chest.

I spent quite a while looking through various forums for experiences from other players who had looked at this before and the harness which sounded most like what I was after was the FREEneck harness from ERGOnomic systems (they’re big fans of MIXed CAse WORds apparently). The appeal of this system was that it was supposed to put no pressure on your neck at all, and it also had a conventional string and hook arrangement which should allow the same freedom as a conventional neck strap.

Purchasing

Unfortunately I couldn’t find a supplier for this harness in the UK, so I ended up ordering it from Thomann in Germany.  I’d ordered a couple of things from them before and they were as efficient this time as they had been before and the harness turned up a couple of days later.

The first thing thing to say about this harness is that it’s not cheap.  It’s not the most expensive harness I’d seen, but it’s pretty close at €105.  I was especially nervous as I’d have to buy this without being able to try it but decided to take the plunge with the option of selling it if I didn’t like it (you can’t buy them in the UK so someone must want it).

How it works

freeneck harnessThe harness itself comes in 3 parts.  The main bit is a normal looking neck strap string and hook from which projects a pair of long metal rods which extend down your back.  The end of these then connect into a pocket on a separate belt which clips around your waist.  Finally there is a pad which fits over the rods to provide some extra support.  The pad was listed on the box, but was actually missing from the harness I received. I’ve not yet contacted the seller to get a replacement sent out – I probably should but in all honesty I’m not sure I’d use it even if I had it.

Setting up the harness

freeneck barsThe harness comes with a single sheet of instructions which include both an English and German translation.  They give a very brief overview of how to set the harness up but although they say what you can adjust on the harness they don’t give a very clear impression of how it should sit and feel when it’s set up correctly so you’ll have to work this out for yourself.  The basic idea for the system is that the neck strap part of the harness doesn’t actually touch the back of your neck – instead it sits a couple of centimeters behind.  When you attach a sax to the hook the weight pulls on the neck strap but this is prevented from coming forward by the rods which run down your back.  These in turn sit in the pocket on the belt so the whole system pivots around your back so the weight is spread generally over your back and hips.

The major adjustment you need to make is to the length of the metal rods.  These come in two parts, an upper and lower section, which are separated by long threaded screws which you can lengthen or shorten to get the rods positioned correctly.  The screws are shipped set to a very short setting (to make the harness smaller to transport) so you’ll need to unscrew them a long way if you’re tall like me.  I was worried that there wasn’t going to be enough thread to extend to my height but the screws are actually really long and there’s plenty of slack in them.  One annoying design feature is that to extend the screws you have to completely detach the upper and lower parts of the harness.  This means that you end up spending quite a lot of time unscrewing and re-connecting the two halves to get the length correct.  This is annoying, but you’ll only do it once so it’s not such a big deal.  The other adjustment you can make is that you can bend the top part of the bars to fit the curve of your back.  Be careful when doing this though – I fitted mine so it was comfortable and then found that when I hung a sax off the hook the shape changed so that it was pressing on my neck, and I had to adjust it back again.  I found that the final setup I ended up with had the neck strap being a few centimeters off my neck with no sax attached, and almost (but not quite) touching my neck when the sax was present.

Playing comfort

freeneck hookGenerally the harness is pretty comfortable to play on.  You need to make sure that the pocket on the belt sits right in the middle of your back so that the bars don’t press unevenly on your back.  You’ll also need to make sure the belt sits low enough that it doesn’t end up digging into your stomach which is both uncomfortable and can restrict your breathing.  The belt itself needs to be quite tight so that the bars can’t move out much from your back and it’s not always easy to stop it riding up over your hips.

The harness doesn’t put any pressure on your neck, and your head is free to move around naturally.  However you feel pressure down your back, especially between your shoulder blades where the bars pivot between the belt and the neck strap.  This isn’t uncomfortable as long as your back is reasonably straight, but if you start to slouch then you’ll feel the bars starting to dig in – this is a harness which encourages good posture!

There are also a couple of other places where the harness puts pressure.  The most noticeable is that the strings from the neck strap press down onto the front of your shoulders and can dig into your collar bones.  The two strings are separated by a long plastic tube, which stops the strings coming together around your neck, but also allows the plastic tube to dig into your chest.  Both of these issues are only really present when you’re holding, but not playing the sax and once you move it into a playing position they sort themselves out and the harness is actually very comfortable.

The big advantage of this system over other harnesses is that it uses a conventional string and hook arrangement so you have a lot of flexibility to move your sax around to your preferred playing position.  There is normal height adjustment on the string and this has a lot of slack in it so I could comfortably use the same harness on all of my saxes from a short alto to a long baritone setting.

One thing I did find is that it seems that this system works best when you’re playing standing up.  This is probably where a harness is most required, and I’ve played for extended periods whist standing with no problems.  If you’re sitting down then it’s not quite as good.  The belt is harder to keep in position and tends to ride up more when can cause the other parts of the harness to move out of alignment.  It’s not uncomfortable to use sitting down, but if you play a lot standing up then it works really well.

Build Quality and Practicality

freeneck wearingThe harness seems to be very well built.  The materials and quality of finish seem to be very good.  The only concern I had was the hook – this is an all plastic variety and although I’ve had no problems and I’m sure it’s up to the job it doesn’t look very solid.  The hook on my harness was quite stiff and the opening when you pull back the pin isn’t very big so it might not be the best if you have to do a lot of quick switches of instruments, but this might free up with more use.

One big potential drawback with this harness is its size.  It’s pretty big – it has to be at least the length of your back, and it doesn’t collapse for transport or storage unless you want to completely unscrew the bars each time (which you won’t).  There’s also a rigid panel at the back of the neck strap which limits how small you can fold this part.  What this means is that you’re not going to be able to keep this in your sax case.  I couldn’t even get it to fit down the bell of my bari (the bars would fit, but the neck strap was too wide to sit in the end of the bell).  I’ve got an enormous gig bag which I take around and it goes in there (just) but if you’re used to just carrying a sax case around then this is going to end up being an extra piece of luggage.

Once you’re wearing the harness you can’t really see it from the front.  You should be able to put it on under a suit jacket or DJ and you can take the bars off over your head so you don’t have to commit to wearing the harness between sets if you’re doing a long gig.  If you’re wearing it on top of your normal clothes then you can see it running down your back, but it’s not at all unsightly and I’d be happy to wear it like this.

Summary

Overall I like the FREEneck harness.  It’s a lot more comfortable than the strap I had before and my back hasn’t been hurting since I’ve been using it.  I’d be quite happy to play for extended periods on this, even if I had to be standing.  The major downsides of the harness are the cost and the practicality of carrying the thing around, but it could be a good solution for a lot of players.

Tags: ,
Posted in Music | Comments Off


Fast subset selection by row name in R

December 5th, 2013

Introduction

One of the best features about R is the simple way you can use a number of different strategies to create subsets of large data tables.  The basic selection mechanisms you have are that you can subset a data frame by providing:

  1. A set of column or row indices
  2. A set of boolean values to specify whether individual columns or rows can be included
  3. A set of row or column names to include in the filtered set

Mixing these basic mechanisms with the plethora of other functions R provides to generate these data types gives you a simple way to create powerful filters to reshape your data.

The Problem

Anyone who has used R for a significant amount of time will be painfully aware that although it’s syntax can be very powerful, the implementation of some of its functions is often not very efficient.  I had always assumed that the core selection and filtering tools would be very efficient though.  It turns out I was wrong.

Specifically I came across a case where I wanted to make a subselection of rows from a very large data frame (~500k rows) based on the row names.  I therefore tried what seemed to be the obvious solution, which was to do a rowname based lookup.  This it turns out is very slow.  You can see the problem using this simple example below which selects 2500 random rows from a 1 million row data frame.  On my machine the last line takes around 20 seconds to complete, and the completion time seems to scale linearly with the number of random rows selected.

my.data <- data.frame(values1=rnorm(1000000),
values2=rnorm(1000000),values3=rnorm(1000000))

rownames(my.data) <- paste("row",1:1000000,sep="")

sample(row.names(my.data),2500) -> random.rows

my.data[random.rows,] -> filtered.data

The Solution

The limiting step in the slow code above seems to be the actual lookup of the list of names against the rownames in the data frame.  Lookups by index position do not appear to be slow, so you can work around this by doing the lookups yourself manually before running the filter. You can easily do this with the match() function which will take in a vector of strings and will give you the index position of each one in a second vector.

This means we can change the filtering part of the original script to be:

my.data[match(random.rows,rownames(my.data)),] -> quick.filtered.data

..which completes in less than a second.  In this case the lookup in the data frame is done based on the indices returned by match so the match function and the lookup within match appears to be implemented much more efficiently than the core code in the data frame filtering.

There may be some added functionality in the core data frame lookup code of which I’m not aware but it seems odd (and bad) that the natural way to do this kind of lookup is so inefficient.  I’d have expected that the core lookup would have been doing what match does behind the scenes and should be able to return a result just as efficiently as with our modified solution.  Hopefully this can be addressed in future, but at least for now there is a fairly simple work round which can make things much quicker.

Tags:
Posted in Bioinformatics, Computing | Comments Off


Why is java not found on the command line after I’ve installed it?

September 9th, 2013

The Problem

A common problem we have reported from users of SeqMonk is that when they try to launch the program on a Windows system they get an error message saying that java could not be found, even though they have installed this and they can show that it works fine in their browser.

When SeqMonk looks for java it’s actually testing whether the command shell has been configured such that running “java -version” returns a valid java version identifier.  If you run this on a system with java correctly installed you should get something like this:

C:\Users\andrewss>java -version
java version “1.7.0_13″
Java(TM) SE Runtime Environment (build 1.7.0_13-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)

However on a machine where java has not been correctly installed to the command line you’ll get something like this:

C:\Users\andrewss>java -version
‘java’ is not recognized as an internal or external command,
operable program or batch file.

The Reason

If you’ve installed java and the installation completed successfully with no errors, but you still can’t access the java program on the command line then the most likely cause is that there is a mismatch between the version of java you have and your operating system.  Most PCs these days ship with a 64-bit operating system, and this is able to run both 64-bit and 32-bit programs.  Java also comes in 64-bit and 32-bit versions, and both of these can run on 64-bit windows (32-bit windows can only run 32-bit java).

If you are using java to run local programs then you generally want to run the 64-bit version of java since it gives you access to more of your system’s memory and includes other efficiencies.  For historical reasons though Oracle don’t link to the 64-bit verison of java on their default download page, so it’s very easy to be led into installing the 32 bit version of java.  For some applications this won’t matter, but as secondary consequence of this is that installing the 32-bit version of java on a 64-bit windows system will not install the java program onto the command line, so although it will work in a browser it won’t work in your command shell and programs such as SeqMonk won’t find it.

The Solution

The quick fix for this problem is to install the 64 bit version of java.  You can install this alongside the 32 bit version and the can coexist happily so anything which does need the 32 bit version (some browsers still might) can still use it, but the 64 bit version will then work on the command line and you will be able to access all of your system’s memory in the java programs you run.

The path you need to take to get this to work is shown below:

Java Front ScreenFirst, go to java.com, and click on the link which says “Free Java Download”

 

 

 

 

 

 

See all Java DownloadsOn the next screen ignore the link which says “Agree and Start Free Download”, but instead look for, and click, the smaller link below which says “See all Java downloads”

 

 

 

 

Java 64 DownloadFrom the list on the next page you should find and click the option which says “Windows Offline (64-bit)”.  This will download the correct version of java which, once installed, should allow you to run command line java programs correctly.

 

 

Tags: , ,
Posted in Computing | 1 Comment »


A new way to look at duplication in FastQC v0.11

September 3rd, 2013

Introduction

After a long gestation we’ll be releasing a new version of FastQC in the near future to address some of the common problems and confusions we’ve encountered in the current version.  I’ll write more about this in future posts but wanted to start with the most common complaint, that the duplicate sequence plot was difficult to understand and didn’t really show the information people wanted to see.  I wrote a post previously to explain how the current module works but in the new release this module will be significantly restructured to give a more comprehensive, relevant and hopefully intuitive view of the duplication within a library.

What was wrong with the old version?

A few things actually…

  1. To keep memory usage sane we had to sample the library rather than count all sequences.  This meant that what we showed in the plot was the relative number of sequences within the sample set with different duplication levels.  Because there wasn’t enough space to show all possible duplication levels we had to set a limit at 10, but this meant that the plot couldn’t accurately reflect the level of duplication just by looking at the graph, because you couldn’t tell whether the sequences in the 10+ field occurred 11 or 11,000 times.
  2. Because we hadn’t found a nice way to extrapolate from the subset of data we monitored to the whole dataset, the y axis for the plot expressed duplication levels relative to unique sequences.  This was easy to do but was difficult to interpret, and was pretty misleading in cases where there were virtually no unique sequences in the library.
  3. We included an overall duplication value at the top of the plot, which was a percentage value, but again this value was difficult to interpret since the value shown was simply the percentage of sequences in the library which were not unique.  What this meant was that a library containing the same sequence in every read, and a library with diverse sequences which were all present exactly twice would both have a duplication level of 100%, which wasn’t very good.

What have we done to fix this?

The new version of the plot still retains some of the properties (and limitations) of the old method.  Basically, the way in which the initial count data is collected hasn’t changed.  We still sample the library so that we don’t require huge amounts of memory, and we still only look at the first 50bp of any library over 75bp so we don’t get messed up by base calling error.  It’s the way we treat and represent this data which has changed.

The inspiration for the new version of the module mainly revolves around an improved way of extrapolating from our limited dataset to the rest of the library so that we can accurately predict what we would have seen if we had actually measured the whole library.  The maths for this was kindly worked out and written up for us by Brian Howard of sciome.com, to whom we’re extremely grateful, and we’ve adapted his approach in the new version.

The new plot has three major changes to the old one.

  1. The number of x-axis categories has been increased to expand the 10+ category into several groups to give you a clearer picture of the true level of duplication in your sample.
  2. The y-axis of the plot now represents a percentage of the total library, so all values are directly comparable and much easier to understand.
  3. The plot now contains two lines instead of one.  Both of these profiles show the proportion of the library which comes from sequences with different levels of duplication.  The two traces simply show the percentages of the raw and deduplicated libraries.  By looking at the two traces you can quickly see how the library is divided between different duplication levels, and by comparing the two traces you can easily see whether the 10+ category is composed of a small number of very highly duplicated sequences, or a much larger number of moderately duplicated sequences.

In addition to the changes to the plots we have also changed the headline duplication level value.  This is still shown as a percentage, but the value now shows what percentage of the library would remain if you deduplicated it to keep only one copy of every different sequence, which fits a lot better with how people actually think about duplication.

So is it now perfect?

Yes.

OK, no.  But it’s much better.  There are however some problems which aren’t solved (and may not be solvable).

  1.  The sampling of the library is still based on the first N different sequences we see (where N is currently 200,000 but in the new scheme it doesn’t really matter too much what it is).  This approach relies of seeing the sequences in a random order which is fine for fastq files, but you can still get misleading answers if you feed a sorted BAM file into the program.  This isn’t really something we can solve since the only guaranteed solution involves holding every possible sequence in memory, which is bad.
  2. When setting warn / error limits we have to make assumptions about the library which aren’t really true.  We assume that we’re sampling from an infinitely sized genome, and that no duplication is always better.  In reality you might be sampling very deeply from a small genome where you’d expect a certain level of duplication, and where lower duplication levels might be just as worrying as high levels.  Alternatively you might be sampling in a way where even representation isn’t expected (such as RNA-seq libraries where representation reflects expression levels).  In theory we could address the first case if we were given a genome size, but the second case is very difficult unless we make some big assumptions about the expected distribution of expression levels.  This might be more of a problem if library sizes got much bigger, but it’s not too bad at the moment.
  3. All of the duplication detection is based on an exact match between two sequences and even a single base change is enough to call the sequences as different.  In libraries of poor quality, which contain a larger percentage of sequencing errors, the duplication level will be under-reported as the sequencing errors will be incorrectly interpreted as diversity in the library.  This might be theoretically solvable with fuzzy matching, or a quality aware algorithm, but both of these would be significantly more computationally intensive than the current approach and would slow down the whole program – something we’re trying very hard not to do.

Can I see an example?

Of course.  Let’s start with a nice library and work our way down to something nasty.

 

Example 1 – Very diverse library

low_duplicationHere we have a library with low duplication.  In both the raw and deduplicated versions of the library the vast majority of reads come from sequences which only occur once within the library.  This suggests that the library sampling isn’t anywhere close to saturation and that you have a diverse population.

 

Example 2 – RNA-Seq

rnaseq_duplicationThis is a good RNA-Seq library, but actually represents the kind of profile you might see in any enriched experiment.  In general FastQC makes the assumption that you expect to be looking at a diverse library with a roughly even representation of all sequences.  In enriched libraries this isn’t going to be true and there are certain sequences (highly expressed genes or enriched peaks in a ChIP) where you want the sequence to occur more frequently.  Above a certain read density the only way to place more reads into a region is via duplication so we might not expect to have a properly diverse library.

What we can see in this case is that we have lots of sequences with low duplication levels (2-9 copies) which probably cover most of the ‘normal’ genes we’ve measured in the library.  Then there are a bunch of sequences with moderate duplication (10 – 100 copies) which may cover things like rRNA and some repeat sequences.  There are very few highly duplicated sequences which might be more indicative of a technical problem with the library.

SeqMonk will probably throw a warning or error from these libraries since fairly large chunks of the library will be lost were it to be deduplicated, but the profile seen here actually isn’t too bad for RNA-Seq.  If anyone has a suggestion for a better generic metric to use than the amount of loss upon deduplication then we’d be happy to hear suggestions.

 

Example 3 – PhiX

phix_duplicationIn a PhiX control lane you expect that all of the PhiX sequences will be present thousands of times, so we don’t expect this to look pretty.  From the blue line we can see that in the raw library the majority of the library is composed of sequences which are present 1000 – 5000 times, with very few sequences from the other categories.

Interestingly we can also see that when we deduplicate the library that the vast majority of sequences in the deduplicated set would come from reads which were present only once in the original library, suggesting that the deduplicated set would be dominated by sequencing errors or contaminants.  This is maybe a good indication of why deduplicating your data as a matter of course is a really bad idea.

 

Example 4 – Horrible PCR duplication

high_duplicationThis was about the worst example I could find.  This was an RNA-Seq library which had some nasty contamination and PCR duplication.  you can see that in the raw data the majority of reads come from sequences which are present tens of thousands of times in the library, but that there is also strong representation from reads which are present hundreds or thousands of times, suggesting that the duplication is widespreads.

In this case deduplicating may help the experiment, but it causes the loss of 93% of the original data so it’s a pretty drastic step to take.

 

 

 

 

Tags: , , ,
Posted in Bioinformatics | 3 Comments »


Generating R reports with vector images from markdown with knitr

June 21st, 2013

Introduction

One really nice addition to a standard R environment is the ability to create reports which combine R code, comments and embedded graphical output.  The original mechanism for doing this was Sweave, but more recently a second system called knitr has emerged which seems to be more flexible, and this is what I’ve been using

Basic Knitr Use

One of the aspects I particularly liked about knitr was that you could use simple markdown syntax to create your reports rather than having to use Latex, which I’ve never really got to grips with.  It’s really simple to create documents which mix comments, code and graphics.  Here is a simple example:

Test
=====

Simple test markdown for use with knitr

```{r}
plot(1:20,(1:20)^2)
```

If you use the standard “Knit HTML” function built into Rstudio then this will generate an HTML document with bitmap PNG images actually embedded within the document source as base64 encoded data.  This is fine, but what I often wanted was the ability to generate vector images so that they could later be extracted for publication without having to rerun the script.

Having done some searches I found lots of different bits of information which described different ways to convert the various files which knitr can produce, but with wildly varying levels of success, so having figured out some nice ways to make this work I thought I’d document them.

Using Vector Images directly in your HTML

The simplest way to get vector images into your document is simply to change the type of image you use to a vector based image.  Most web browsers these days can understand SVG, and this is a standard output option for R so you can simply change your code to:

Test
=====

Simple test markdown for use with knitr

```{r dev='svg'}
plot(1:20,(1:20)^2)
```

..and you’ll now get an HTML document with embedded SVG graphics.

I did however find a problem with this.  Some of the graphs I was drawing contained a reasonably large number of points and the SVG files created were also very large which made them unwieldy and slow to open.  This is however the simplest way to do your conversion.

Using PDF as an output format

An alternative method to use would be to generate a PDF document as your output format.  Rs PDF output seems to be better structured than its SVG and PDF images for plots which were very large in SVG were tiny in their PDF equivalents.

Knitr itself won’t natively create PDFs and the easiest way I found to do the conversion was to use a program called pandoc.  Pandoc can read in files in a variety of formats and convert them into PDFs using a command as simple as:

pandoc -s input.html -o output.pdf

I found this suggestion in a few places for converting the HTML output from knitr into PDF, but there are a couple of problems with this approach.

  1. The default HTML output from Rstudio uses html encoding and pandoc can’t understand this.  In order to the conversion you need to tell the program to create xhtml encoded output.  You can do this by manually running the conversion in an R session
    knit("knitr_test.Rmd","knitr_test.md") 
    
    markdownToHTML("knitr_test.md","knitr_test.html",options=c("use_xhtml"))

    which will create an xhtml version of your Rmd file which can be converted.

  2. The second problem is the pandoc will embed whichever figures you created in your initial knitr run and try to put these into the PDF.  This is fine if you used PNG files, but it can’t handle SVG files so you’ll get an error if you try to use these.  The solution to this is either
    1. Change your device type in your file to
      {r dev='pdf'}

      This will produce an html file you can’t actually open in a browser as the PDF links won’t be rendered, but pandoc will recognise them and make a nice PDF from them.

    2. Change your device type to PDF and run knitr, but then don’t generate HTML but instead run pandoc on the .md file created by knitr.  This will actually produce a more nicely formatted PDF than going from the HTML.  One problem you will find with this though is that you get odd figure legends under your plots (Figure 1: plot of chunk unnamed-chunk-1 for example).  You can fix this by putting in a proper figure caption on the R code which generates the plot
      ```{r dev='pdf', fig.cap="My first figure"}

Using these relatively simple steps you should be able to turn an R script with a minimal amount of additional formatting into a professional looking document with publication quality figures embedded into it.

Tags: , ,
Posted in Bioinformatics, Computing | Comments Off


Syncing calendars between iOS and android

October 4th, 2012

Introduction

Having been happily using a combination of OSX and iOS for a few years now I’ve recently expanded my device collection with a google nexus 7 tablet.  I’ve been using a separate IMAP based email account for a while so this was easy to set up on the new device, but I also wanted my calendars to work.

Up until this point I’d been using the iTunes based syncing to manage my calendars between my local copy of iCal and the calendars on my iPod.  This was somewhat inconvenient since the calendars would occasionally be out of sync, but for the most part it worked well.

Now that I had an android device I wanted to be able to both view my events on my computer, iPod and tablet, and be able to create new events on any of these devices and have them sync to the others.

I read a few different howtos for how to achieve this, but thought it would be worth recording how I made it work in the end since no single guide.

Moving data to google calendar

Since all of the iCal events I had were stored in a local database on my mac, and since I didn’t intend physically connecting my tablet to my computer to sync I realised that I’d need to move my events into google calendar and use that as the central repository.  In my existing iCal I’d set up a few different calendars for work / home / other event types, but although google calendar can be made to support this, it’s really designed to have a single calendar within which you can customise the colour with which different events are shown.  I therefore decided to merge my various calendars into a single google calendar.

It’s possible to connect a google calendar to iCal using iCal account preferences and adding in your google credentials, however this enables sync in only one direction (pulled down from google), none of your existing events are synchronised to google.  You can send individual events to your google calendar by dragging them into the google entry in the calendar list, but this won’t work for large numbers of events.

The easiest way to transfer all of the events was to export each of my calendars as an ics file by selecting them in turn in the iCal calendar list, and then selecting File > Export > Export.  One you have the collection of iCal files you can go to the google calendar web page.  Under the “Other calendars” drop down there is an option to import events from an iCal file.  If you do this for each of your files then all of your existing events will be copied into google calendar.

Adding new events via iCal

Once all of your existing iCal events have been imported to google calendar you can then delete the local calendars from your iCal preferences. You can add your google calendar by going to iCal > Preferences > Accounts > Pressing + to add an account > Select ‘Google’ for the account type and then entering your google credentials.

To ensure that new events go into this calendar go to the “General” section of your preferences and select this calendar as your Default Calendar.

Syncing calendars to an IOS device

My first attempt at putting the google calendar information on my iPod was to use the iTunes sync options and selecting the google calendar I’d added to iCal.  This did indeed cause all of the events to show up in my IOS iCal, but new events could only be added to the local calendar and wouldn’t therefore have passed through to my other devices.

The correct way to do this was to add my Gmail account as an account in the general IOS settings.  Under Settings > Mail, Contacts, Calendars I chose “Add account” and then selected Gmail – some guides (including ones from google) say to use Exchange or CalDav, but only the Gmail option worked smoothly.  You simply need to enter your gmail credentials, and then once these are verified you can turn off email sync if you only want to import calendar events.  Once this was done the calendar showed all of my events, and new events which were added synchronised over to my computer and tablet the next time I was attached to a network.

I now therefore have the same calendar available on my computer, tablet and iPod with the ability to make changes on any device.  As a bonus I can also choose to share the calendar with the rest of my family so they can see my events.

Tags: , , , ,
Posted in Computing, Technology | Comments Off


Should you buy a nanopore sequencer?

February 18th, 2012

This morning twitter is awash with posts discussing the newly announced nanopore sequencers from Oxford Nanopore. Speculation has been rife for some time about the potential specifications of the first sequencers to be produced by the company, and it certainly appears that the company have fulfilled the expectations placed upon them.

I’m not going to go through the details of the two sequencers announced – others have done a good job of listing the specs here and here, but basically you have machines with either 512, or 2000 nanopores capable of sequencing fragments up to 40kb (but probably several kb routinely) and error rates of around 4%, mostly indels, and with promises of imminent improvements to bring this value down – all at a per-base cost similar to the best of the existing platforms.

Reading through the initial comments about this new platform my first reaction was that we have to get one (or more) of these, but after calming down and thinking about this for a bit I thought I’d have a stab at going through the use cases where this type of sequencer really makes sense.

De-dovo sequence assembly: Oh yes!

The one place where this new platform is a complete no-brainer is if you’re assembling de-novo genome sequence. Whilst Illumina sequencers can give you good coverage depth from paired end reads of around 100bp there are always regions of the genome whose repetitive nature mean that this will not provide enough context to allow a contiguous assembly. Currently you either need to start creating mate-pair libraries, which are notoriously difficult to produce, or you need to get your floor reinforced and stump up a huge amount of cash for a PacBio.  The prospect of generating reads of 10kb+ with a simple library prep should be music to your ears, and a 4% error rate with short indels should be easy to work around with a mixed assembly.

Metagenomics: Oh yes!

In the same vein as de-novo assembly the propect of longer reads should make metagenomic studies much easier.  Getting more context for your reads should allow you to distinguish between related species much more easily and assembly of mixed bacterial populations should be possible even with the slightly more limited throughput of these sequencers.

Genotyping: Possibly

I guess the main advantages of the nanopore platform for genotyping is the speed with which it can generate data.  Data collection begins almost immediately upon addition of the sample, and real-time monitoring of the data output means that you can immediately stop the run once you have observed all of the variants you are looking for.  The long read lengths should allow the illucidation of even the most complex genome re-arrangements.  The somewhat high error rates may be problematic, but if these really are mostly indels, then SNP calling might still be practical.  The per-base cost means that the current sequencers aren’t yet practical for real time diagnostic use, but future developments on this platform would seem to make this a possibility.

Epigentics: Possibly

One of the promises of nanopore sequencing was the ability to distinguish modified bases during the base calling process.  PacBio have shown that they are able to distinguish hydrox-methyl-cytosine from cytosine, and suggest that identification of methyl-cytosine is theoretically possible. In the reports I’ve seen so far Oxford Nanopore haven’t said anything concrete yet about the ability of their platform to call modified bases, but if this proves to be possible and reliable then this will become an essential bit of kit for labs working on epigentics.  The ability not only to directly read modifications directly, but to be able to put these in the context of a multi-kb fragment is truly exciting.  The addition of a hairpin structure at the end of a fragment would also allow these sequencers to read both strands of the same fragment, again providing contextual information which has so far been lacking.

It’s possible that the nanopore sequencers may still be of use to epigenetics even without the ability to read modifications directly.  Genome wide bisulphite sequencing is already being undertaken on Illumina sequencers, and should be possible on nanopore sequencers, however the bisulphite treatment itself is very harsh, and fragments the DNA sample as it modifies it, so the super-length reads able to be obtained from normal genomic DNA may be elusive once it has been modified.

 ChIP-Seq: Not really

The power of ChIP-Seq comes from the number of observations you make, not the length of those observations.  The nanopore sequencers seem to be best suited to sequencing fewer, longer fragments which would not be an advantage for ChIP-Seq.  There seems no obvious reason why short insert libraries couldn’t be sequenced on a nanopore platform, but at the moment we know very little about the overhead of starting a new sequence on the same nanopore so this may be feasible, but longer read lengths would simply reduce the resolution of the ChIP assay.  For some applications it might be interesting to monitor ChIP results in real time, and be able to halt a run once clear peaks had emerged, but in the short term I can’t see this being a good option for this type of experiment.

RNA-Seq: It depends

As with ChIP-Seq, much of the power of RNA-Seq comes from the number of observations which have been made.  To make a reasonable measurement of low-expressed transcripts then very large numbers of sequences must be generated, and the existing short read platforms will likely have an advantage in this regard for some time, so for simple quantitation of transcripts the nanopore platform may not offer huge advantages.  Where the longer read lengths of the nanopore sequencers will  be of use will be in the elucidation and quantitation of splice variants.  Current RNA-Seq protocols provide coverage of a very small part of the transcript and often do not provide enough context to determine exactly which splice variant the reads came from.  Performing relative quantitation of the splice variants of a gene is therefore not a simple process.  Longer reads from a nanopore sequencer could cover the whole length of a transcript removing all doubt about exactly which variant it was.  Whether this proves to be a useful tool for expression quantitation will depend on whether the platform is able to generate an unbiased selection of reads (or a selection with a well understood bias) to allow accurate quantitation.

So…

So do I think we should get one of these sequencers?  Heck yes! For $900 a piece for the MinIon there’s absolutely no excuse for everyone not to get one and start playing with it to see what it can do. For much of the workload we currently have it may be that this platform isn’t going to revolutionise what we do, but if nothing else it will hopefully spur on the existing manufacturers to push forward the development of their existing platforms.  In any case the scientists win.  We live in exciting times…

Tags:
Posted in Bioinformatics, Technology | Comments Off


Review of Hanson SB5 Baritone Saxophone

February 12th, 2012

Background

Hanson music wasn’t a name I’d seen before I saw an advert for one of their baritone saxes on Ebay. I’d normally be wary of buying an unknown brand of sax over the internet, but there was one thing which piqued my interest.  Most companies have a selection of favourable customer comments on their site, but Hanson simply have a note saying “Google us and see what others are saying”. This shows some confidence in your online reputation, and Hanson were right to make this claim. Pretty much every comment I found about them was positive, and many people were strongly recommending them as an excellent source for affordable saxophones.

There seems to have been a number of companies appearing over the last few years who are changing the perception of the cheaper end of the saxophone market. This used to be associated with flimsy instruments with poor build quality and intonation, but there are now several companies who are making very playable saxes at very competitive prices.  I’ve written before about Antigua Winds who I have used before and who produce a wide range of instruments.  Hanson music though seem to be following more in the line of companies such as Kessler Music, who have made a name for themselves in the States by producing their own line of saxes, and providing excellent advice and customer service.  Having something similar in the UK sounded like a very attractive prospect.

Buying the Sax

The sax on offer was an ex-demonstrator model of their mid-range SB-5 baritone saxes. I was looking for a baritone to mostly play in big bands and wind orchestras.  I wasn’t prepared to spend the £4000+ it would cost for a new Yamaha or Yanigasawa so had been looking for a reasonable second hand offer.  As a mid-range sax the SB5 would have been about the level I wanted.  At £2200 for a new one it’s slightly more than the Antigua Winds 5595LQ, but with the discount for this instrument being an ex-demonstrator I decided to take the plunge.

The sax I bought was available either through Ebay or from Hanson’s own web site.  I actually bought the sax from the Hanson site directly. The site was reasonably informative about the features of the various models, but I was disappointed that there were so few images of the instruments available.  Looking just now there are no images at all of any of the baritone saxes which is ridiculous.  If I’m going to spend over a thousand pounds then at least take some decent pictures of what you’re selling.  In my case there were some photos on the Ebay auction so I could see what I was getting, but even these were quite pale images on a white background and didn’t really show the sax off to best effect. Actually purchasing through their site was mostly OK apart from the site not recognising our postcode (Cambridge postcodes all changed about 3 years ago so someone needs to update their postcode database!).

After placing the order I immediately received an email receipt.  I emailed them with a minor query and had a reply almost immediately, which seemed to match with the good experiences of everyone else I’d read about online.  A couple of hours later I got a tracking code from the courier service who were delivering the sax.  The next day I got a text to give me a 2 hour delivery slot when it would arrive and it was there right on the dot of the start of the slot.  The sax was well packed into a cardboard box surrounded by inflatable packing.

Initial Impressions

As an ex-demonstrator my sax arrived without a mouthpiece or strap so I had a couple of days before I could actually play it, but I gave it a good check over.  I wasn’t expecting the sax to be completely flawless, but I was pleasantly surprised at its condition.  There were a couple of minor dings on the neck and the bottom of the bell, but nothing which caused me any concern.  The sax itself seemed to be very solidly constructed and the action on the main keys was nicely weighted.  The finish on the lacquer was much nicer in real life than it appeared in the photos on the web with a colour which was much richer and deeper than I was expecting, which was a very pleasant surprise.  However, every sax I’ve ever bought had some issues and this one was no exception.  My sax arrived with a slightly sticky octave mechanism which would have caused problems with fast passages in the upper register, but a bit of lubrication quickly fixed this. The low A pad also wasn’t closing completely, but a minor adjustment to the limiter soon fixed this.

Apart from the sax itself, the case it came in is worthy of comment. It goes without saying that a baritone is not the easiest thing to move around, so having a good case is a real bonus. The case supplied with the SB5 was excellent. It’s a hard case with a nicely textured plastic exterior.  The handles on top and side are very comfortable, and it has built in wheels which is going to be a real lifesaver.  The case is secured by 4 very solid catches, two of which are lockable.  Having seen the cases which come with high end Yanigasawas I can confidently say that I’d much rather have the case which came with the SB5.

Playing

Since my sax didn’t come with a mouthpiece I had to buy one to go with it, but I’d have done this anyway.  I got a good deal on an Otto Link 7, and paired this with some Vandoren 3 reeds (at £6 a piece!, and which turned out to be way too hard).  Getting the Link mouthpiece onto the SB5 was a bit of a job.  Getting the tuning right meant putting the mouthpiece on right to the end of the cork, which is a very tight fit, even with well greased cork.  I’m still getting to grips with the setup, but having played my first gig on the instrument I feel confident enough to make some comments about the way the instrument plays.

Generally I’ve been very happy with the playing characteristics of the SB5.  The intonation across the whole range is excellent and I was quickly able to produce a sound I was happy with.  The notes at the top end of the lower register seem to be slightly weaker than the rest, but are still OK, and may improve as I play with the setup.  The upper register is a delight to play with a strong tone being easy to produce right up to top F#.  The only problems I’ve had have been right at the bottom end where the low notes activated by the left hand keys (C#,B and Bb) have proved to occasionally difficult to hit and maintain (low C and bottom A are consistently good).  I suspect the reason is that I’m not pressing the keys in this group cleanly and am catching surrounding keys.  The springing on some of these low keys is quite weak, which may exacerbate this problem, but it’s probably something which will improve with more familiarity, or if not could be fixed with stronger springs.

Summary

Overall I am very pleased with the SB5. It has proven to be a solid and competent instrument. Realistically, it’s not the equal of something like a Yanigasawa, but it’s been a pleasure to play on and will serve my needs very well.  You’re getting an awful lot of instrument for your money and everything I’ve learned of the company would make me happy to recommend them to others, or use them again myself.

 

Tags: ,
Posted in Music | Comments Off


The true cost of object creation in java

December 6th, 2011

I’ve been spending some time trying to optimise the data loading part of one of my java projects.  The nature of the data we use means that we have to create hundreds of millions of objects, each of which internally stores only a single long value (it actually stores several fields packed into this value using a bitmask since this is more memory efficient).

When loading our data we are therefore parsing hundreds of millions of long values and creating the associated objects.  This can take a few minutes to complete, and having profiled the code it seems that it is the object creation which slows everything down.  I therefore did some tests to work out exactly how slow the creation of objects is relative to the primitives which exist in java.  My test code is below:

public class CreateTest {

    public static void main (String [] args) {
        long start = System.currentTimeMillis();

        long [] primitives = new long[50000000];
        for (int i=0;i<primitives.length;i++) {
            primitives[i] = i;
        }

        long end = System.currentTimeMillis();

        System.out.println("Making 50 million longs took "+(end-start)+"ms");

        start = System.currentTimeMillis();

        Long [] objects = new Long[50000000];
        for (int i=0;i<primitives.length;i++) {
            objects[i] = new Long(i);
        }

        end = System.currentTimeMillis();

        System.out.println("Making 50 million Longs took "+(end-start)+"ms");
    } 
}

It’s reasonable to think that there will be an overhead for object creation, but I was surprised by the results:

Making 50 million longs took 199ms
Making 50 million Longs took 10809ms

So that’s a 50-fold overhead for the object wrappers around these numbers.  What’s worse is that this overhead seems to happen inside the JVM in such a way that you can’t take advantage of multi-threading to get around it.  I tried refactoring the code to have 5 threads creating 10million reads each, and the total runtime across 5 cores was pretty much exactly the same as doing the same thing on a single core.  This means that if you want to have 50 million objects available in your program then you’re just going to have to wait 10 seconds for them, however many cores you want to throw at the problem.

I also investigated other options for object creation. Namely I made my object cloneable and then used clone() to create new instances rather than calling the constructor.  The constructors for my object are very lightweight, so it was disappointing, but not surprising to see that this had no appreciable effect on the time taken for object creation.

I’ve even toyed with the idea of just storing these objects as an array of longs and avoiding this overhead all together.  I could still extract the relevant data by using a set of static methods, but what I can’t then do is to sort these objects (which I need to) since there’s no way to do a custom sort in java without putting the data into objects (which would defeat the point).

I’m therefore stuck with the biggest bottleneck in my program being something which I know is able to be improved by 50X (and would then make everything hugely quick), but not within the confines of the java language.

Tags: , ,
Posted in Computing | Comments Off