<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Proteo.me.uk</title>
	<atom:link href="http://proteo.me.uk/feed/" rel="self" type="application/rss+xml" />
	<link>http://proteo.me.uk</link>
	<description>Biology, Tech, Computing, whatever...</description>
	<lastBuildDate>Sat, 18 Feb 2012 13:02:35 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Should you buy a nanopore sequencer?</title>
		<link>http://proteo.me.uk/2012/02/should-you-buy-a-nanopore-sequencer/</link>
		<comments>http://proteo.me.uk/2012/02/should-you-buy-a-nanopore-sequencer/#comments</comments>
		<pubDate>Sat, 18 Feb 2012 13:02:35 +0000</pubDate>
		<dc:creator>simon</dc:creator>
				<category><![CDATA[Bioinformatics]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[hardware]]></category>

		<guid isPermaLink="false">http://proteo.me.uk/?p=204</guid>
		<description><![CDATA[This morning twitter is awash with posts discussing the newly announced nanopore sequencers from Oxford Nanopore. Speculation has been rife for some time about the potential specifications of the first sequencers to be produced by the company, and it certainly appears that the company have fulfilled the expectations placed upon them. I&#8217;m not going to [...]]]></description>
			<content:encoded><![CDATA[<p>This morning twitter is awash with posts discussing the newly announced nanopore sequencers from Oxford Nanopore. Speculation has been rife for some time about the potential specifications of the first sequencers to be produced by the company, and it certainly appears that the company have fulfilled the expectations placed upon them.</p>
<p>I&#8217;m not going to go through the details of the two sequencers announced &#8211; others have done a good job of listing the specs <a href="http://omicsomics.blogspot.com/2012/02/oxford-nanopore-doesnt-disappoint.html">here</a> and <a href="http://pathogenomics.bham.ac.uk/blog/2012/02/oxford-nanopore-megaton-announcement-why-do-you-need-a-machine-exclusive-interview-for-this-blog/">here</a>, but basically you have machines with either 512, or 2000 nanopores capable of sequencing fragments up to 40kb (but probably several kb routinely) and error rates of around 4%, mostly indels, and with promises of imminent improvements to bring this value down &#8211; all at a per-base cost similar to the best of the existing platforms.</p>
<p>Reading through the initial comments about this new platform my first reaction was that we have to get one (or more) of these, but after calming down and thinking about this for a bit I thought I&#8217;d have a stab at going through the use cases where this type of sequencer really makes sense.</p>
<h2>De-dovo sequence assembly: Oh yes!</h2>
<p>The one place where this new platform is a complete no-brainer is if you&#8217;re assembling <em>de-novo</em> genome sequence. Whilst Illumina sequencers can give you good coverage depth from paired end reads of around 100bp there are always regions of the genome whose repetitive nature mean that this will not provide enough context to allow a contiguous assembly. Currently you either need to start creating mate-pair libraries, which are notoriously difficult to produce, or you need to get your floor reinforced and stump up a huge amount of cash for a PacBio.  The prospect of generating reads of 10kb+ with a simple library prep should be music to your ears, and a 4% error rate with short indels should be easy to work around with a mixed assembly.</p>
<h2>Metagenomics: Oh yes!</h2>
<p>In the same vein as <em>de-novo</em> assembly the propect of longer reads should make metagenomic studies much easier.  Getting more context for your reads should allow you to distinguish between related species much more easily and assembly of mixed bacterial populations should be possible even with the slightly more limited throughput of these sequencers.</p>
<h2>Genotyping: Possibly</h2>
<p>I guess the main advantages of the nanopore platform for genotyping is the speed with which it can generate data.  Data collection begins almost immediately upon addition of the sample, and real-time monitoring of the data output means that you can immediately stop the run once you have observed all of the variants you are looking for.  The long read lengths should allow the illucidation of even the most complex genome re-arrangements.  The somewhat high error rates may be problematic, but if these really are mostly indels, then SNP calling might still be practical.  The per-base cost means that the current sequencers aren&#8217;t yet practical for real time diagnostic use, but future developments on this platform would seem to make this a possibility.</p>
<h2>Epigentics: Possibly</h2>
<p>One of the promises of nanopore sequencing was the ability to distinguish modified bases during the base calling process.  PacBio have shown that they are able to distinguish hydrox-methyl-cytosine from cytosine, and suggest that identification of methyl-cytosine is theoretically possible. In the reports I&#8217;ve seen so far Oxford Nanopore haven&#8217;t said anything concrete yet about the ability of their platform to call modified bases, but if this proves to be possible and reliable then this will become an essential bit of kit for labs working on epigentics.  The ability not only to directly read modifications directly, but to be able to put these in the context of a multi-kb fragment is truly exciting.  The addition of a hairpin structure at the end of a fragment would also allow these sequencers to read both strands of the same fragment, again providing contextual information which has so far been lacking.</p>
<p>It&#8217;s possible that the nanopore sequencers may still be of use to epigenetics even without the ability to read modifications directly.  Genome wide bisulphite sequencing is already being undertaken on Illumina sequencers, and should be possible on nanopore sequencers, however the bisulphite treatment itself is very harsh, and fragments the DNA sample as it modifies it, so the super-length reads able to be obtained from normal genomic DNA may be elusive once it has been modified.</p>
<h2> ChIP-Seq: Not really</h2>
<p>The power of ChIP-Seq comes from the number of observations you make, not the length of those observations.  The nanopore sequencers seem to be best suited to sequencing fewer, longer fragments which would not be an advantage for ChIP-Seq.  There seems no obvious reason why short insert libraries couldn&#8217;t be sequenced on a nanopore platform, but at the moment we know very little about the overhead of starting a new sequence on the same nanopore so this may be feasible, but longer read lengths would simply reduce the resolution of the ChIP assay.  For some applications it might be interesting to monitor ChIP results in real time, and be able to halt a run once clear peaks had emerged, but in the short term I can&#8217;t see this being a good option for this type of experiment.</p>
<h2>RNA-Seq: It depends</h2>
<p>As with ChIP-Seq, much of the power of RNA-Seq comes from the number of observations which have been made.  To make a reasonable measurement of low-expressed transcripts then very large numbers of sequences must be generated, and the existing short read platforms will likely have an advantage in this regard for some time, so for simple quantitation of transcripts the nanopore platform may not offer huge advantages.  Where the longer read lengths of the nanopore sequencers will  be of use will be in the elucidation and quantitation of splice variants.  Current RNA-Seq protocols provide coverage of a very small part of the transcript and often do not provide enough context to determine exactly which splice variant the reads came from.  Performing relative quantitation of the splice variants of a gene is therefore not a simple process.  Longer reads from a nanopore sequencer could cover the whole length of a transcript removing all doubt about exactly which variant it was.  Whether this proves to be a useful tool for expression quantitation will depend on whether the platform is able to generate an unbiased selection of reads (or a selection with a well understood bias) to allow accurate quantitation.</p>
<h2>So&#8230;</h2>
<p>So do I think we should get one of these sequencers?  Heck yes! For $900 a piece for the MinIon there&#8217;s absolutely no excuse for everyone not to get one and start playing with it to see what it can do. For much of the workload we currently have it may be that this platform isn&#8217;t going to revolutionise what we do, but if nothing else it will hopefully spur on the existing manufacturers to push forward the development of their existing platforms.  In any case the scientists win.  We live in exciting times&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://proteo.me.uk/2012/02/should-you-buy-a-nanopore-sequencer/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Review of Hanson SB5 Baritone Saxophone</title>
		<link>http://proteo.me.uk/2012/02/review-of-hanson-sb5-baritone-saxophone/</link>
		<comments>http://proteo.me.uk/2012/02/review-of-hanson-sb5-baritone-saxophone/#comments</comments>
		<pubDate>Sun, 12 Feb 2012 20:49:32 +0000</pubDate>
		<dc:creator>simon</dc:creator>
				<category><![CDATA[Music]]></category>
		<category><![CDATA[review]]></category>
		<category><![CDATA[saxophone]]></category>

		<guid isPermaLink="false">http://proteo.me.uk/?p=186</guid>
		<description><![CDATA[Background Hanson music wasn&#8217;t a name I&#8217;d seen before I saw an advert for one of their baritone saxes on Ebay. I&#8217;d normally be wary of buying an unknown brand of sax over the internet, but there was one thing which piqued my interest.  Most companies have a selection of favourable customer comments on their [...]]]></description>
			<content:encoded><![CDATA[<h2>Background</h2>
<p><a href="http://www.hansonmusic.co.uk/">Hanson music</a> wasn&#8217;t a name I&#8217;d seen before I saw an advert for one of their baritone saxes on Ebay. I&#8217;d normally be wary of buying an unknown brand of sax over the internet, but there was one thing which piqued my interest.  Most companies have a selection of favourable customer comments on their site, but Hanson simply have a note saying &#8220;Google us and see what others are saying&#8221;. This shows some confidence in your online reputation, and Hanson were right to make this claim. Pretty much every comment I found about them was positive, and many people were strongly recommending them as an excellent source for affordable saxophones.</p>
<p>There seems to have been a number of companies appearing over the last few years who are changing the perception of the cheaper end of the saxophone market. This used to be associated with flimsy instruments with poor build quality and intonation, but there are now several companies who are making very playable saxes at very competitive prices.  I&#8217;ve written before about <a href="http://www.antiguawinds.com/">Antigua Winds</a> who I have <a href="http://proteo.me.uk/2011/03/first-impression-of-antigua-winds-ss490lq-soprano-saxophone/">used before</a> and who produce a wide range of instruments.  Hanson music though seem to be following more in the line of companies such as <a href="http://www.kesslermusic.com/">Kessler Music</a>, who have made a name for themselves in the States by producing their own line of saxes, and providing excellent advice and customer service.  Having something similar in the UK sounded like a very attractive prospect.</p>
<h2>Buying the Sax</h2>
<p>The sax on offer was an ex-demonstrator model of their mid-range SB-5 baritone saxes. I was looking for a baritone to mostly play in big bands and wind orchestras.  I wasn&#8217;t prepared to spend the £4000+ it would cost for a new Yamaha or Yanigasawa so had been looking for a reasonable second hand offer.  As a mid-range sax the SB5 would have been about the level I wanted.  At £2200 for a new one it&#8217;s slightly more than the Antigua Winds 5595LQ, but with the discount for this instrument being an ex-demonstrator I decided to take the plunge.</p>
<p>The sax I bought was available either through Ebay or from Hanson&#8217;s own web site.  I actually bought the sax from the Hanson site directly. The site was reasonably informative about the features of the various models, but I was disappointed that there were so few images of the instruments available.  <a href="http://www.hansonmusic.co.uk/category/139/Baritone_saxophones">Looking just now</a> there are no images at all of any of the baritone saxes which is ridiculous.  If I&#8217;m going to spend over a thousand pounds then at least take some decent pictures of what you&#8217;re selling.  In my case there were some photos on the Ebay auction so I could see what I was getting, but even these were quite pale images on a white background and didn&#8217;t really show the sax off to best effect. Actually purchasing through their site was mostly OK apart from the site not recognising our postcode (Cambridge postcodes all changed about 3 years ago so someone needs to update their postcode database!).</p>
<p>After placing the order I immediately received an email receipt.  I emailed them with a minor query and had a reply almost immediately, which seemed to match with the good experiences of everyone else I&#8217;d read about online.  A couple of hours later I got a tracking code from the courier service who were delivering the sax.  The next day I got a text to give me a 2 hour delivery slot when it would arrive and it was there right on the dot of the start of the slot.  The sax was well packed into a cardboard box surrounded by inflatable packing.</p>
<p style="text-align: center;"><a href="http://proteo.me.uk/wp-content/uploads/2012/02/SB5_View.jpg"><img class="aligncenter  wp-image-196" title="Hanson SB5" src="http://proteo.me.uk/wp-content/uploads/2012/02/SB5_View.jpg" alt="" width="480" height="360" /></a></p>
<h2>Initial Impressions</h2>
<p>As an ex-demonstrator my sax arrived without a mouthpiece or strap so I had a couple of days before I could actually play it, but I gave it a good check over.  I wasn&#8217;t expecting the sax to be completely flawless, but I was pleasantly surprised at its condition.  There were a couple of minor dings on the neck and the bottom of the bell, but nothing which caused me any concern.  The sax itself seemed to be very solidly constructed and the action on the main keys was nicely weighted.  The finish on the lacquer was much nicer in real life than it appeared in the photos on the web with a colour which was much richer and deeper than I was expecting, which was a very pleasant surprise.  However, every sax I&#8217;ve ever bought had some issues and this one was no exception.  My sax arrived with a slightly sticky octave mechanism which would have caused problems with fast passages in the upper register, but a bit of lubrication quickly fixed this. The low A pad also wasn&#8217;t closing completely, but a minor adjustment to the limiter soon fixed this.</p>
<p>Apart from the sax itself, the case it came in is worthy of comment. It goes without saying that a baritone is not the easiest thing to move around, so having a good case is a real bonus. The case supplied with the SB5 was excellent. It&#8217;s a hard case with a nicely textured plastic exterior.  The handles on top and side are very comfortable, and it has built in wheels which is going to be a real lifesaver.  The case is secured by 4 very solid catches, two of which are lockable.  Having seen the cases which come with high end Yanigasawas I can confidently say that I&#8217;d much rather have the case which came with the SB5.</p>
<p style="text-align: center;"><a href="http://proteo.me.uk/wp-content/uploads/2012/02/SB5_Case.jpg"><img class="aligncenter  wp-image-190" title="SB5_Case" src="http://proteo.me.uk/wp-content/uploads/2012/02/SB5_Case.jpg" alt="" width="480" height="360" /></a></p>
<h2 style="text-align: left;">Playing</h2>
<p>Since my sax didn&#8217;t come with a mouthpiece I had to buy one to go with it, but I&#8217;d have done this anyway.  I got a good deal on an Otto Link 7, and paired this with some Vandoren 3 reeds (at £6 a piece!, and which turned out to be way too hard).  Getting the Link mouthpiece onto the SB5 was a bit of a job.  Getting the tuning right meant putting the mouthpiece on right to the end of the cork, which is a very tight fit, even with well greased cork.  I&#8217;m still getting to grips with the setup, but having played my first gig on the instrument I feel confident enough to make some comments about the way the instrument plays.</p>
<p>Generally I&#8217;ve been very happy with the playing characteristics of the SB5.  The intonation across the whole range is excellent and I was quickly able to produce a sound I was happy with.  The notes at the top end of the lower register seem to be slightly weaker than the rest, but are still OK, and may improve as I play with the setup.  The upper register is a delight to play with a strong tone being easy to produce right up to top F#.  The only problems I&#8217;ve had have been right at the bottom end where the low notes activated by the left hand keys (C#,B and Bb) have proved to occasionally difficult to hit and maintain (low C and bottom A are consistently good).  I suspect the reason is that I&#8217;m not pressing the keys in this group cleanly and am catching surrounding keys.  The springing on some of these low keys is quite weak, which may exacerbate this problem, but it&#8217;s probably something which will improve with more familiarity, or if not could be fixed with stronger springs.</p>
<h2>Summary</h2>
<p>Overall I am very pleased with the SB5. It has proven to be a solid and competent instrument. Realistically, it&#8217;s not the equal of something like a Yanigasawa, but it&#8217;s been a pleasure to play on and will serve my needs very well.  You&#8217;re getting an awful lot of instrument for your money and everything I&#8217;ve learned of the company would make me happy to recommend them to others, or use them again myself.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://proteo.me.uk/2012/02/review-of-hanson-sb5-baritone-saxophone/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The true cost of object creation in java</title>
		<link>http://proteo.me.uk/2011/12/the-true-cost-of-object-creation-in-java/</link>
		<comments>http://proteo.me.uk/2011/12/the-true-cost-of-object-creation-in-java/#comments</comments>
		<pubDate>Tue, 06 Dec 2011 13:18:54 +0000</pubDate>
		<dc:creator>simon</dc:creator>
				<category><![CDATA[Computing]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[object]]></category>

		<guid isPermaLink="false">http://proteo.me.uk/?p=180</guid>
		<description><![CDATA[I&#8217;ve been spending some time trying to optimise the data loading part of one of my java projects.  The nature of the data we use means that we have to create hundreds of millions of objects, each of which internally stores only a single long value (it actually stores several fields packed into this value [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been spending some time trying to optimise the data loading part of one of my java projects.  The nature of the data we use means that we have to create hundreds of millions of objects, each of which internally stores only a single long value (it actually stores several fields packed into this value using a bitmask since this is more memory efficient).</p>
<p>When loading our data we are therefore parsing hundreds of millions of long values and creating the associated objects.  This can take a few minutes to complete, and having profiled the code it seems that it is the object creation which slows everything down.  I therefore did some tests to work out exactly how slow the creation of objects is relative to the primitives which exist in java.  My test code is below:</p>
<pre>public class CreateTest {

    public static void main (String [] args) {
        long start = System.currentTimeMillis();

        long [] primitives = new long[50000000];
        for (int i=0;i&lt;primitives.length;i++) {
            primitives[i] = i;
        }

        long end = System.currentTimeMillis();

        System.out.println("Making 50 million longs took "+(end-start)+"ms");

        start = System.currentTimeMillis();

        Long [] objects = new Long[50000000];
        for (int i=0;i&lt;primitives.length;i++) {
            objects[i] = new Long(i);
        }

        end = System.currentTimeMillis();

        System.out.println("Making 50 million Longs took "+(end-start)+"ms");
    } 
}</pre>
<p>It&#8217;s reasonable to think that there will be an overhead for object creation, but I was surprised by the results:</p>
<p>Making 50 million longs took 199ms<br />
Making 50 million Longs took 10809ms</p>
<p>So that&#8217;s a 50-fold overhead for the object wrappers around these numbers.  What&#8217;s worse is that this overhead seems to happen inside the JVM in such a way that you can&#8217;t take advantage of multi-threading to get around it.  I tried refactoring the code to have 5 threads creating 10million reads each, and the total runtime across 5 cores was pretty much exactly the same as doing the same thing on a single core.  This means that if you want to have 50 million objects available in your program then you&#8217;re just going to have to wait 10 seconds for them, however many cores you want to throw at the problem.</p>
<p>I also investigated other options for object creation. Namely I made my object cloneable and then used clone() to create new instances rather than calling the constructor.  The constructors for my object are very lightweight, so it was disappointing, but not surprising to see that this had no appreciable effect on the time taken for object creation.</p>
<p>I&#8217;ve even toyed with the idea of just storing these objects as an array of longs and avoiding this overhead all together.  I could still extract the relevant data by using a set of static methods, but what I can&#8217;t then do is to sort these objects (which I need to) since there&#8217;s no way to do a custom sort in java without putting the data into objects (which would defeat the point).</p>
<p>I&#8217;m therefore stuck with the biggest bottleneck in my program being something which I know is able to be improved by 50X (and would then make everything hugely quick), but not within the confines of the java language.</p>
]]></content:encoded>
			<wfw:commentRss>http://proteo.me.uk/2011/12/the-true-cost-of-object-creation-in-java/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Moving over to Casava 1.8</title>
		<link>http://proteo.me.uk/2011/09/moving-over-to-casava-1-8/</link>
		<comments>http://proteo.me.uk/2011/09/moving-over-to-casava-1-8/#comments</comments>
		<pubDate>Fri, 16 Sep 2011 12:25:05 +0000</pubDate>
		<dc:creator>simon</dc:creator>
				<category><![CDATA[Bioinformatics]]></category>
		<category><![CDATA[Computing]]></category>
		<category><![CDATA[casava]]></category>
		<category><![CDATA[illumina]]></category>
		<category><![CDATA[sequencing]]></category>

		<guid isPermaLink="false">http://proteo.me.uk/?p=166</guid>
		<description><![CDATA[Introduction Illumina have recently released an updated version of their downstream analysis software CASAVA.  This is the analysis pipeline which runs after the sequencer has processed the raw data down to base call files and provides a variety of functionalities from creating usable base calls to alignment and variant calling.  Casava 1.8 makes some major [...]]]></description>
			<content:encoded><![CDATA[<h1>Introduction</h1>
<p>Illumina have recently released an updated version of their downstream analysis software CASAVA.  This is the analysis pipeline which runs after the sequencer has processed the raw data down to base call files and provides a variety of functionalities from creating usable base calls to alignment and variant calling.  Casava 1.8 makes some major changes to previous versions so this isn&#8217;t going to just be a drop in replacement for the workflow you used to have.  You&#8217;re going to have to think about this one.</p>
<p>As we&#8217;ve just got through the processing of our first run with Casava 1.8 I wanted to write up the problems (and in some cases solutions) we&#8217;d found both so I can remind myself later on and hopefully help other people who might end up in a similar situation.</p>
<h2>Our existing pipeline</h2>
<p>Our usage of the Illumina pipeline is fairly straight forward.  We use it to generate sequence with qualities and in most cases we use Eland to map to a reference genome.  We haven&#8217;t done any variant calling through the pipeline and RNA-Seq mapping was done externally with tophat rather than with Illumina&#8217;s own tools.  We have a LIMS system which manages our samples for us and we link into the run folders created by the sequencer to present data back to the user.</p>
<h2>A long long time ago&#8230;</h2>
<p>In older versions of the pipeline there was a strictly hierarchical set of programs which created a structured set of folders which matched the order in which the programs were run.  Each folder was named after the program which created it and the time it was run.  The order of execution was:</p>
<p>Images (tiff) &gt; [firecrest] &gt; Intensities (cif) &gt; [Bustard] &gt; Base Calls (qseq) &gt; [Gerald] &gt; Filtered Base Calls (fastq) and mapped data (sorted)</p>
<p>In more recent times much of this processing moved into SCS on the sequencer itself so the pipeline actually started at base calls, leaving only a small script to create qseq files and then run Gerald as normal.  You could still run all or part of the full pipeline if you wanted to.</p>
<p>There were good and bad things about this pipeline.  On the plus side you had full control over what was run, and it was easy to tell when looking through a run folder exactly what had been done simply by looking at the directory structure.  Configuration was achieved mostly within a single gerald config file which said what type of analysis you wanted to run and provided details of the reference genome to use.  This config file could be passed to any stage in the pipeline and would pass down the levels so that with a single command you could run the whole pipeline from images to mapped data and be confident it was all going to work.</p>
<p>On the bad side this pipeline got quite clunky if you had to demultiplex your samples.  There were also issues with file types &#8211; illumina made their own variant of fastq files which used a different quality encoding to everyone else (two different variants actually), they also used a non-standard format for their alignments which required a non-trivial conversion to a more standard format for some downstream applications.  Also the final results files were named solely after the lane in which they were sequenced, so you couldn&#8217;t tell from the file name alone with run a file had come from, which was a pain when mixing files from different runs in the same analysis.  Finally the output files tended to be uncompressed text files, which as outputs got bigger (especially with the advent of the HiSeq) got very big and unwieldy.</p>
<p>Having said all of this, the system worked and had remained largely unchanged since the dawn of the GAII, albeit with different sections moving around between the pipeline server to the IPAR (remember those?), and the control PC.</p>
<h2>A new hope&#8230;</h2>
<p>Casava 1.8 was Illumina&#8217;s attempt to address some of the problems which had arisen in older versions.  It wasn&#8217;t just a tweak on the existing pipeline but rather a major update which fundamentally changed the way in which the pipeline was run and the types and locations of the output files.  The update had been planned for many months, with Illumina <a href="http://futo.cs.yale.edu/mw/images/6/68/CASAVA1_8_Changes.pdf">putting out a list of proposed changes</a> for comment and <a href="http://seqanswers.com/forums/showthread.php?t=8895">responding to questions about these</a> well before the release of the new pipeline.  The major changes were to be:</p>
<ol>
<li>The new pipeline would be based around standard file formats (fastq with sanger encoding, and BAM for alignments)</li>
<li>The pipeline would have much improved support for barcoded samples</li>
<li>The pipeline would allow sample annotations to be used to identify samples and help with data management</li>
</ol>
<p>Concerns were raised about some of these proposals, but reassuring noises were made by Illumina who assured everyone that people with existing workflows would be able to transition smoothly to the new system, which would also make new analyses easier to run.</p>
<h2>The Empire Strikes Back&#8230;</h2>
<p>So now that Casava 1.8 has been released how much does this new pipeline make the life of the bioinformatician easier or harder?  Since we&#8217;ve been working on our first Casava run I&#8217;ll go through the changes we&#8217;ve had to make and the problems we&#8217;ve hit.  It is not at all unlikely that there are simple work rounds for some of our problems, but I&#8217;ve spent a long time going through the documentation and if there is an easy way around this then I haven&#8217;t yet found it.</p>
<h3>Config files</h3>
<p>One of the major annoyances in the new version is the lack of consistency in the config files required to process a run.  Up until now there was only one config file (gerald.conf) which laid out the type of analysis required, either globally or on a per-lane basis.  With the new version there are now two separate config files required along with many more command line options.</p>
<p>In the new casava the gerald.conf file is still present (although now called config.txt) with substantially the same options as before, but this is now supplemented by a compulsory sample sheet csv file which must be provided when processing of the run starts.  The sample sheet allows you to specify a flowcell id, sample name, description and associated project name along with any barcode sequences, a reference genome, sequencing recipe, operator and a flag to say if the sample is a control.  You must provide a sample sheet to complete base calling and the sample names and project names are used to name the output files from base calling.  Any listed barcodes are used to split the sequences into their respective subgroups.</p>
<p>There are a couple of problems here.  Firstly there&#8217;s nothing in the documentation which says which fields in the sample sheet are required and which are optional.  It appears that the only things you actually need are the lane, sample name and project name fields.  You can specify a flowcell id but it&#8217;s not validated against the run you&#8217;re processing.  You can specify a reference genome, but it&#8217;s not actually used when you do an alignment &#8211; you have to specify it again when you write your gerald.conf file.  The other fields seem to only be there for your records and not because the pipeline uses them.</p>
<p>If, like us, you already have a LIMS system for managing your samples then having to write out a blank sample sheet I don&#8217;t need is an annoyance I could do without.  Also, since our LIMS needs to automatically recognise the output files from a run folder this is substantially harder to do when there is no standard default blank sample sheet.  It&#8217;s going to take a lot more parsing to find the files associated with a given lane than in the previous release although since either the prefix or suffix of the files involved is controlled this is still possible.</p>
<p>Next there is the problem of configuration.  In the old system there was a single configuration file which you presented at the top of your analysis stack and which passed down to all levels.  In the new system it&#8217;s much more of a mess.</p>
<ul>
<li>For base calling you need to provide a sample sheet.</li>
<li>For alignment you need to provide a gerald conf file</li>
<li>For BAM creation and variant calling you need to specify the reference genome on the command line and do this separately for every sample.</li>
</ul>
<p>That&#8217;s 3 different places you can set the reference genome, only 2 of which are ever used!  The reference genome now also needs to be a set of fasta files which need to be in files ending with .fa (mine ended in .txt but the docs don&#8217;t say that, you just work it out when you get an error when you try to use them).</p>
<p>Whereas the transition from one stage of the pipeline to the next used to be handled via a recursive make, the new system now works by letting you specify a command to run after the current analysis has successfully completed.  This is in some ways more flexible as it allows you to insert your own custom scripts into the workflow, but it&#8217;s also a real pain to work out what set of commands you&#8217;re going to need to run to get to the very end of the pipeline, especially as these are going to reference folder names which don&#8217;t yet exist, and whose names are dynamically generated from the data in the sample sheet.  Add to this that BAM file creation is now a per-sample process and this means that configuring a full run is now a major undertaking, and one which is very fragile since the pipeline itself cannot validate the options for anything beyond the initial analysis step.  Unless you have a completely standard analysis then this is going to put an end to running stuff over the weekend and being sure that it will have finished successfully when you come back on Monday!</p>
<h3>Output files</h3>
<p>The move to use standard formats for output files should have made life easier for users of the pipeline, but in practice it&#8217;s actually harder to work with the new system than the old one.</p>
<p>One change which has been introduced, probably due to the number of sequences generated by hi-seq machines, is that you now get a set of output files rather than a single file for all of the analyses.  This was sometimes the case for internal formats in previous releases (eg qseq files), but not it applies to some of the final outputs.  The sets of files have a common start and end with a grouping number which let you know how to recombine them.  So, for a GAIIx lane you might expect to get up to 8 gzipped fastq files for each lane.  You can pass an option to increase the number of sequences in each file , but if you make that limit over 16million then it breaks eland, so you&#8217;re really stuck with multiple output files.  This same problem affects eland output as well with multiple gzipped export files.</p>
<p>Another problem with the fastq output is that unlike older versions of the pipeline the fastq files in Casava 1.8 are not filtered, so you&#8217;ll have an entry for every cluster on your flowcell.  A flag in the header tells you if your sequence should be filtered, but none of the downstream programs will understand that, so you&#8217;re going to have to do it yourself, which takes away the point of having fastq files assembled for you already!</p>
<p>Illumina do acknowledge both of these problems in the documentation, and provide documented workrounds.  Unfortunately both of the workrounds they provide are broken &#8211; they say that you can simply cat the gzipped fastq files together to combine them, but unfortunately this puts gzip headers in the middle of your file.  Some decompressors (<a href="http://www.manpagez.com/man/1/gzip/">gzip itself for example</a>) can understand this and will extract all of the data, but others (the standard java <a href="http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425">gzip extractor for example</a>) do not, and just stop extracting data (with no warnings or errors) after the end of the first compressed section.  There is also a somewhat complex grep statement provided which should filter out flagged sequences to produce a fastq file, but this <a href="http://seqanswers.com/forums/showpost.php?p=48864&amp;postcount=70">was reported</a> to create broken fastq files from some inputs, and leaves you with uncompressed data.</p>
<p>We ended up writing our own short Perl script which filtered, combined and re-compressed the fastq files but this (well the compressing part) is a slow process which we&#8217;re currently performing manually.</p>
<p>The other standard output file used in the new version is BAM files for aligned output.  What wasn&#8217;t said though is that this is not the default output from eland (which is still export files, albeit now gzipped).  Creating BAM output requires that you run part of the variant calling pipeline to run the sort and bam targets to turn your export files into a bam.  There is a stand alone script to to this without going through variant calling, but this only produces SAM files.  There is also a &#8211;bam option for Eland, but you can only set this (according to the docs at any rate) if you run Eland as a stand alone program, and not by running the actual pipeline.  The documentation for this part of the pipeline is tortuous with at least 3 different way to create BAM files all with their own options and requirements.</p>
<p>When you do finally create a BAM you find that in order to do this you must create an arbitrarily named directory (so no chance of finding it with our LIMS), which contains a completely generically named bam file (sorted.bam), so really no way to associate this robustly with a sample, along with separate bam files for every chromosome in your reference genome, even if you&#8217;re not doing any other analysis.  All to get the mapped data in an output you can use with any other downstream application.</p>
<p>Finally, the base calls, alignments and variant calls are now not stored in a structured heirarchy under the data folder, but get their own top level folders inside the run folder (base calls go in &#8216;Unaligned&#8217;, export files go in &#8216;Aligned&#8217; and BAM files go wherever you tell them, with no defaults).  What this means in effect is that you can only run one set of base calls and alignments if you need to be able to reliably predict where your output will go.  Subsequent runs will fail if the output folder already exists.  I realise many people won&#8217;t ever both reprocessing their data, but we do this all the time and this change is going to limit what we can do.  We can work around it locally, but we also distribute a <a href="http://www.bioinformatics.bbsrc.ac.uk/projects/sierra/">LIMS</a> which has to find these files on other sites and this gets to be impractical in the new system (and impossible in the case of variant calling output).</p>
<h2>The bioinformatician strikes back&#8230;</h2>
<p>So what is the outcome of all of this?</p>
<p>In our case the outcome is that we&#8217;re going to:</p>
<ol>
<li>Use a standard default sample sheet for all runs so we at least get consistent directory naming</li>
<li>Rewrite our LIMS to be able to recognise the new folder structure generated from our standard sample sheet</li>
<li>Write a script to filter and combine the multiple fastq files into a single file we can give back to our users</li>
<li>Stop using Eland for our mapping.  It&#8217;s just too much of a pain to put this into the pipeline and link the results with our LIMS. We were using bowtie and tophat for secondary mappings anyway but these will take over as our primary mapping pipeline.</li>
<li>Work out a nasty hack to allow our LIMS to recognise reprocessed folders &#8211; which won&#8217;t work on any site other than ours.</li>
</ol>
<p>All in all, for our use case this update has made our pipeline significantly less functional than it used to be.  I can see that there are probably people who would benefit from some of the changes here, but they are far from being the promised panacea.  I&#8217;d be interested to know how many groups are actually doing any downstream analysis in Casava, as opposed to stopping after base calling and using other options.  I got used to getting funny looks when I said we were using Eland, and I suspect these changes will make the users of Casava for analysis an ever rarer breed.</p>
<p>With luck some or all of the problems we&#8217;ve experienced will be rectified in future releases.  But I think once we&#8217;ve moved away from doing anything beyond base calling in Casava then I don&#8217;t see us moving back.</p>
<h2>Addenda&#8230;</h2>
<p>Since posting this originally I&#8217;ve received some feedback which might be useful.</p>
<p>It appears that you <a href="http://seqanswers.com/forums/showthread.php?p=51576">can run Casava without a sample sheet</a> at all, even though the documentation makes no mention of this, and actually says it will read one from a default location if one isn&#8217;t specified.  The output directories are named after the flowcell id for the the project name, and the lane number for the samples &#8211; which is way better than using a minimal default sample sheet.</p>
<p>In a <a href="http://seqanswers.com/forums/showthread.php?p=52126">more recent update</a> Illumina say they are intending to release an update to Casava (1.8.1) which will change the behaviour of the BCL converter to not include reads which failed QC by default.  They will also provide an optional flag to write all of the sequence output into a single file.  This should solve our problems with this part of the pipeline.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://proteo.me.uk/2011/09/moving-over-to-casava-1-8/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Importing RNA-Seq data into SeqMonk</title>
		<link>http://proteo.me.uk/2011/09/importing-rna-seq-data-into-seqmonk/</link>
		<comments>http://proteo.me.uk/2011/09/importing-rna-seq-data-into-seqmonk/#comments</comments>
		<pubDate>Sun, 04 Sep 2011 16:12:19 +0000</pubDate>
		<dc:creator>simon</dc:creator>
				<category><![CDATA[Bioinformatics]]></category>
		<category><![CDATA[import]]></category>
		<category><![CDATA[rnaseq]]></category>
		<category><![CDATA[SeqMonk]]></category>

		<guid isPermaLink="false">http://proteo.me.uk/?p=154</guid>
		<description><![CDATA[Introduction Mapped RNA-Seq data coming from eukaryotes is probably the most complicated data type to import into SeqMonk due to it&#8217;s relative complexity and the abundance of options with which you are presented.  Depending on exactly what sort of information you want to know about your data different data import options will be useful, so [...]]]></description>
			<content:encoded><![CDATA[<h1>Introduction</h1>
<p>Mapped RNA-Seq data coming from eukaryotes is probably the most complicated data type to import into SeqMonk due to it&#8217;s relative complexity and the abundance of options with which you are presented.  Depending on exactly what sort of information you want to know about your data different data import options will be useful, so understanding the best way to import your data can be important.</p>
<h1>Options</h1>
<p>Virtually all RNA-Seq data will start out as a BAM / SAM file since this is the only format currently capable of describing mapped positions which span one or more splice junctions.  The options for import are therefore the options found in the BAM / SAM parser.  The options you have are:</p>
<ul>
<li>Is the data single or paired end</li>
<li>Do you want to split the reads into spliced segments</li>
<li>Do you want to import introns instead of exons</li>
</ul>
<p>Each combination of options will produce a different set of aligned reads and will be useful for different types of analysis.  The figure below summarises the reads which are imported from a set of spliced paired end mapped data.  The colours represent alignment direction (red=forward, blue=reverse).</p>
<h2><a href="http://proteo.me.uk/wp-content/uploads/2011/09/rna_seq.png"><img class="aligncenter size-medium wp-image-155" title="RNA Seq Import Options" src="http://proteo.me.uk/wp-content/uploads/2011/09/rna_seq-300x142.png" alt="" width="300" height="142" /></a></h2>
<h2>Choosing the correct options</h2>
<p>The options you choose will depend on what you want to do with your data.  Typically you either want to do one of three things:</p>
<ol>
<li>Quantitatively analyse the expression of each transcript</li>
<li>Quantitatively analyse splice variation</li>
<li>Identify novel transcripts</li>
</ol>
<p>If you want to quantitatively analyse expression then you need to split your reads so that separate reads are recorded for each mapped splice section.  You should then combine this with the base pair quantiation so that each spliced read is counted proportionally in the different exons it spans.  If you have paired end data then selecting paired end import will simply reverse the orientation of any reads coming from the second read in a pair.  If your data is not directional then this won&#8217;t make any difference, but if you have a directional library this will ensure that the correct orientation is assigned to each read.</p>
<p>If your interest is in the analysis of splice variants then the easiest way to do this is to import only the introns from your mapped data. Introns will only be imported from a read which contains an internal splice site, reads which do not splice will be ignored.  Once the introns have been imported these should be quantitated using the read position probe generator, to put a probe over every different splice junction combination and then quantitated with the exact overlap quantitation to get an exact count for each splice combination seen in the data.  If you have paired end data then it&#8217;s best to import introns as paired end data so that splice junctions from the second read match exactly (position and strand) with those from the first read.</p>
<p>Finally, if your interest is in seeing the extend of novel transcripts then you can import spliced mapped data as a normal BAM / SAM file.  Since the mapped region is not expected to be contiguous on the genome you will need to greatly increase the filter for largest allowed distance between the ends of the reads to avoid rejecting reads which span introns.  This type of import will join the ends of the read (or reads for paired end data) so you get the most complete view of the region of the genome spanned by your reads, but the quantitative influence of each read will depend on the size of the introns spanned, so importing data this way should only be used as a qualitative way of viewing your data.  Combining this view of your transcriptome with a more conventional spliced import in a different track should give you the most complete view of the position and extent of each of your transcripts.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://proteo.me.uk/2011/09/importing-rna-seq-data-into-seqmonk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Getting the java heap size you asked for</title>
		<link>http://proteo.me.uk/2011/08/getting-the-java-heap-size-you-asked-for/</link>
		<comments>http://proteo.me.uk/2011/08/getting-the-java-heap-size-you-asked-for/#comments</comments>
		<pubDate>Fri, 26 Aug 2011 13:00:04 +0000</pubDate>
		<dc:creator>simon</dc:creator>
				<category><![CDATA[Computing]]></category>
		<category><![CDATA[hack]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[memory]]></category>

		<guid isPermaLink="false">http://proteo.me.uk/?p=145</guid>
		<description><![CDATA[In a recent post I discussed a method we&#8217;re using for automatically setting the java heap size appropriately at runtime. It now turns out that the issue of setting the heap size is complicated by the fact that the heap size you request on the command line isn&#8217;t necessarily what you get given. In some [...]]]></description>
			<content:encoded><![CDATA[<p>In a <a href="http://proteo.me.uk/2011/07/dynamically-setting-the-java-heap-size-at-runtime/">recent post</a> I discussed a method we&#8217;re using for automatically setting the java heap size appropriately at runtime. It now turns out that the issue of setting the heap size is complicated by the fact that the heap size you request on the command line isn&#8217;t necessarily what you get given. In some cases the differences are modest, but sometimes they can be significant &#8211; amounting to hundreds of megabytes of discrepancy.</p>
<p>The simple test I did was to compare the heap size requested by setting the -Xmx value on the java command with the actual amount of available memory as reported by Runtime.getRuntime().maxMemory().  What I found was that the relationship between these two values isn&#8217;t 1:1, isn&#8217;t fixed at a given ratio, and is platform (and indeed VM) dependent.</p>
<p>According to <a href="http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4391499">this bug report</a> the actual implementation of -Xmx is VM-dependent, so that the value you supply on the command line is merely a suggestion to the VM and it&#8217;s free to do whatever it likes.  Because I&#8217;d like my software to work consistently on all platforms I therefore had a look at what the different VMs actually do.</p>
<p style="text-align: center;"><a href="http://proteo.me.uk/wp-content/uploads/2011/08/heap_size_graph.png"><img class="aligncenter size-medium wp-image-146" title="Relationship between requested heap size and actual heap size" src="http://proteo.me.uk/wp-content/uploads/2011/08/heap_size_graph-300x193.png" alt="" width="300" height="193" /></a></p>
<p>The OSX VM actually stays very close to the requested amount of memory across the whole range of requested heap sizes.  The linux and windows VMs though overcommit at small heap sizes (there seems to be a minimum allowed heap size of ~10MB), but undercommit by up to 12% at larger heap sizes.  When you&#8217;re requesting a heap of serveral gigabytes in size a 12% loss is a significant amount of memory.</p>
<p>Our immediate solution to this problem is to do a trial run where we launch a small program which reports the actual heap size allocated.  We then relaunch the normal java command, increasing the heap request size by a correction factor calculated from the trail run.  This seems to produce consistent results on all platforms and gives us what we asked for in the first place.</p>
]]></content:encoded>
			<wfw:commentRss>http://proteo.me.uk/2011/08/getting-the-java-heap-size-you-asked-for/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Mac application bundle caching</title>
		<link>http://proteo.me.uk/2011/08/mac-application-bundle-caching/</link>
		<comments>http://proteo.me.uk/2011/08/mac-application-bundle-caching/#comments</comments>
		<pubDate>Wed, 10 Aug 2011 19:33:04 +0000</pubDate>
		<dc:creator>simon</dc:creator>
				<category><![CDATA[Computing]]></category>
		<category><![CDATA[OSX]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://proteo.me.uk/?p=142</guid>
		<description><![CDATA[Having spent a frustrating hour or so trying to update a mac application bundle I thought I&#8217;d share a couple of things which caused no end of confusion and aren&#8217;t what you&#8217;d expect and are therefore likely to catch out those working with application bundles for the first time. Basically I was finding that although [...]]]></description>
			<content:encoded><![CDATA[<p>Having spent a frustrating hour or so trying to update a mac application bundle I thought I&#8217;d share a couple of things which caused no end of confusion and aren&#8217;t what you&#8217;d expect and are therefore likely to catch out those working with application bundles for the first time.</p>
<p>Basically I was finding that although I was making changes to my bundle these changes weren&#8217;t actually being applied.  The bundle seemed to hold on to older versions of the data in the bundle even if this had been changed or even deleted.</p>
<p>It turns out that OSX caches information from the application bundle and then doesn&#8217;t check to see if the information in the bundle has been updated unless it is forced to.  There are two separate caches which can affect you, a cache of the bundle icons, and a cache of the Info.plist file.</p>
<h2>Icon cache</h2>
<p>Inside your application bundle you can create a file containing a series of icons for your application.  You make this file from a series of individual bitmap files normally by using the icon composer application which comes with the developer tools in OSX.  What you will find is that if you update your icns file containing your application icons you won&#8217;t actually see any difference in the icons which are shown in the finder.  The reason for this is that OSX caches the icons associated with files in a folder in a hidden file called .DS_Store.  If this file is present then the finder will read the icon data out of that even if the icns file in your application bundle has been updated.</p>
<p>If you want to force the finder to recalculate the DS_Store file then you need to do the following:</p>
<ol>
<li>Make sure you have closed all finder windows showing your folder of interest</li>
<li>From within the terminal delete the .DS_Store file from your folder of interest</li>
<li>Relaunch the finder using Apple &gt; Force Quit &gt; Finder &gt; Relaunch (this won&#8217;t affect any of your running applications)</li>
</ol>
<p>This will force the finder to recalculate the icon cache for your folder and any changes to your icns file will be visible.</p>
<h2>Info.plist cache</h2>
<p>The Info.plist file in your application bundle provides all of the basic information about how to launch your application, where to find the icon file and various settings relating to the applications name and version.  As with the application icons the information in the Info.plist file is cached by the OS, such that if you change the data in the file (say you change the name of the executable you want to run), then the bundle will remember the old value not the new one.</p>
<p>As an aside, if you&#8217;re having problems running an application bundle, the easiest way to see what&#8217;s actually going wrong is to open the Applications &gt; Utilities &gt; Console application and then relaunch your application.  Any output generated on stdout or stderr by an application bundle will show up in this application and you can see any error messages which are generated.</p>
<p>Clearing the Info.plist cache is a bit easier than clearing the icon cache.  The quickest way to reset it is to simply move your application to a different folder (say your desktop), and then move it back.  This will reset the Info.plist cache and your changes should be applied.</p>
]]></content:encoded>
			<wfw:commentRss>http://proteo.me.uk/2011/08/mac-application-bundle-caching/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Dynamically setting the java heap size at runtime</title>
		<link>http://proteo.me.uk/2011/07/dynamically-setting-the-java-heap-size-at-runtime/</link>
		<comments>http://proteo.me.uk/2011/07/dynamically-setting-the-java-heap-size-at-runtime/#comments</comments>
		<pubDate>Fri, 29 Jul 2011 19:57:54 +0000</pubDate>
		<dc:creator>simon</dc:creator>
				<category><![CDATA[Computing]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://proteo.me.uk/?p=133</guid>
		<description><![CDATA[One of the oddities about java programs is that they require you to set a maximum heap size when you start the program. What this means in effect is that you need to be able to predict the memory usage of your program before it starts, and whatever heap size you set needs to be [...]]]></description>
			<content:encoded><![CDATA[<p>One of the oddities about java programs is that they require you to set a maximum heap size when you start the program. What this means in effect is that you need to be able to predict the memory usage of your program before it starts, and whatever heap size you set needs to be appropriate for all of the system the program is going to run on and all of the datasets it will handle.</p>
<p>When you&#8217;re distributing a <a href="http://www.bioinformatics.bbsrc.ac.uk/projects/seqmonk/">desktop application which needs to process tens of gigabytes of data</a> this can be a problem.  Ideally you&#8217;d like to set a heap size of a few gigabytes to give yourself enough overhead to process even the largest of datasets.  However not all machines have that much RAM installed, and even if they do they require a 64 bit OS to be able to use more than 2GB of it on a single process.</p>
<p>Up until now we&#8217;ve resorted to setting a lowest common denominator heap size (1500M), and providing instructions for increasing this on systems which can handle it.  This is however very inelegant and means we have to warn users if they&#8217;re running out of memory and make them save, reconfigure and restart the application.</p>
<p>We have now moved over to using a system which dynamically sets the heap size to an appropriate value at runtime.  We do this by writing a Perl wrapper which launches the java application after having worked out the most appropriate heap size for the current system.</p>
<p>To do this we have to work out:</p>
<ol>
<li>Whether we have a 32 or 64 bit JRE to work with</li>
<li>The amount of physical RAM in the machine</li>
</ol>
<p>The heap size is set as 2/3 of the physical RAM which leaves enough overhead for the JRE and basic system processes. In addition we set a ceiling on the heap size.  For 32 bit systems this is 1500M which is the most you can practically use given the 2GB per process limit (you have to leave something for the JRE itself). For 64 bit systems we set the ceiling at 6GB. It&#8217;s not in our interest to set the heap size too high as this ends up resulting in long freezes during garbage collection, so we set it to the largest size we&#8217;re practically going to need.  We work out if we&#8217;re on a 64-bit system by parsing the output of java -version (it doesn&#8217;t matter if the OS is 64 bit if the JRE is still 32 bit).</p>
<p>Finding the amount of phyiscal RAM is a platform specific task. On windows we have to use the Win32 API. Under linux we parse the output of &#8216;free&#8217;, and on OSX and the BSDs we parse the output of top.</p>
<p>If the user doesn&#8217;t like our auto-configured heap size we allow them to override it by passing a -m argument to the wrapper script.</p>
<p>For unix-like OSs we don&#8217;t need to worry about perl being present, but for windows we compile the script into a windows binary using <a href="http://search.cpan.org/dist/PAR-Packer/">pp</a>.  The <a href="http://search.cpan.org/~cosimo/Win32-API-0.63/API.pm">Win32::API module</a> is bundled into this binary.  On other platforms we don&#8217;t need to distribute this since it&#8217;s loaded dynamically at runtime only if perl&#8217;s <a href="http://alma.ch/perl/perloses.htm">$^O variable</a> tells us we&#8217;re running under windows.  Under OSX we can run the wrapper nicely from the command line.  We&#8217;re still working out the best way to include this as part of an application bundle.</p>
<p>This isn&#8217;t perhaps the cleanest of solutions, but compared to the very manual process we had before it&#8217;s a lot easier for the users to get their systems set up optimally.</p>
<pre>#!/usr/bin/perl
use warnings;
use strict;
use English;
use FindBin qw($RealBin);
use Getopt::Long;
use IPC::Open3;

my @java_args;</pre>
<pre># See if they manually set a heap size
my $memory;

my $result = GetOptions(
                        'memory=i' =&gt; \$memory,
                         );

if ($memory) {
    if ($memory &lt; 500) {
        die "Memory allocation must be at least 500M";
    }
}
else {
    $memory = determine_optimal_memory();
}
unshift @java_args,"-Xmx${memory}m";

exec "java",@java_args, "uk.ac.bbsrc.babraham.SeqMonk.SeqMonkApplication";

sub print_error {

    # We wrap errors like this so we can keep a windows shell window open
    # so the user can see any errors we generate

    my ($error) = @_;

    warn $error;

    $_ = &lt;STDIN&gt;;

    exit 1;
}

sub determine_optimal_memory {

    # We'll set a ceiling for the memory allocation.  On a 32-bit OS this is going
    # to be 1500m (the max it can safely handle), on a 64-bit OS we won't take more
    # than 6GB
    my $max_memory = 1500;

    # We need not only a 64 bit OS but 64 bit java as well. It's easiest to just test
    # java since the OS support must be there if you have a 64 bit JRE.

    my ($in,$out);
    open3(\*IN,\*OUT,\*OUT,"java -version") or print_error("Can't find java");
    close IN;
    while (&lt;OUT&gt;) {
        if (/64-Bit/) {
            $max_memory = 6000;
        }
    }
    close OUT;

    warn "Memory ceiling is $max_memory\n";

    # The way we determine the amount of physical memory is OS dependent.
    my $os = $^O;

    my $physical;
    if ($os =~ /Win/) {
        $physical = get_windows_memory($max_memory);
    }
    elsif ($os =~/darwin/ or $os =~ /bsd/i) {
        $physical = get_osx_memory($max_memory);
    }
    else {
        $physical = get_linux_memory($max_memory);
    }

    warn "Raw physical memory is $physical\n";

    # We then set the memory to be the minimum of 2/3 of the physical
    # memory or the ceiling, whichever is lower.
    $physical = int(($physical/3)*2);

    if ($max_memory &lt; $physical) {
        return $max_memory;
    }

    warn "Using $physical MB of RAM to launch seqmonk\n";
    return $physical;

}

sub get_linux_memory {
    # We get the amount of physical memory on linux by parsing the output of free

    open (MEM,"free -m |") or print_error("Can't launch free on linux: $!");

    while (&lt;MEM&gt;) {
        if (/^Mem:\s+(\d+)/) {
            return $1;
        }
    }

    close MEM;

    print_error("Couldn't parse physical memory from the output of free");
}

sub get_osx_memory {

    # We get the amount of physical memory on OSX by parsing the output of top

    open (MEM,"top -l 1 -n 0 |") or print_error("Can't get amount of memory on OSX: $!");

    my $total_mem = 0;

    while (&lt;MEM&gt;) {
        if (/^PhysMem:.*?(\d+)M\s+used,\s+(\d+)M\s+free/) {
            $total_mem += $1;
            $total_mem += $2;
        }    
    }

    close MEM;

    unless ($total_mem) {
        print_error("Could't parse physical memory from the output of top");
    }

    return $total_mem;

}

sub get_windows_memory {

    warn "Getting windows physical memory\n";

    # This code was adapted from an answer posted by Tom Feiner on
    # stackoverflow
    #
    # http://stackoverflow.com/questions/423797/how-can-i-find-the-exact-amount-of-physical-memory-on-windows-x86-32bit-using-per

    my ($max_memory) = @_;

    eval {
        require Win32::OLE;
        Win32::OLE-&gt;import qw( EVENTS HRESULT in );
        1;
    } or do {
        print_error("Couldn't load Win32 module to determine windows memory");
    };

    my $WMI = Win32::OLE-&gt;GetObject( "winmgmts:{impersonationLevel=impersonate,(security)}//./" ) || print_error ("Could not get Win32 object: $OS_ERROR");
    my $total_capacity = 0;

    foreach my $object (in($WMI-&gt;InstancesOf( 'Win32_PhysicalMemory' ))) {
        $total_capacity += $object-&gt;{Capacity};
    }

    my $total_capacity_in_mb = $total_capacity / (1024*1024);

    return $total_capacity_in_mb;
}</pre>
]]></content:encoded>
			<wfw:commentRss>http://proteo.me.uk/2011/07/dynamically-setting-the-java-heap-size-at-runtime/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Want to improve your science? Get a dog.</title>
		<link>http://proteo.me.uk/2011/06/want-to-improve-your-science-get-a-dog/</link>
		<comments>http://proteo.me.uk/2011/06/want-to-improve-your-science-get-a-dog/#comments</comments>
		<pubDate>Sun, 19 Jun 2011 10:41:12 +0000</pubDate>
		<dc:creator>simon</dc:creator>
				<category><![CDATA[Bioinformatics]]></category>
		<category><![CDATA[Computing]]></category>

		<guid isPermaLink="false">http://proteo.me.uk/?p=129</guid>
		<description><![CDATA[Actually the dog is somewhat irrelevant &#8211; it&#8217;s what comes with it which matters.  One of the side-effects of dog ownership is that you get to spend an hour or so a day out walking, which means you have an hour or so with your own thoughts and no distractions. I&#8217;m sure everyone has experienced [...]]]></description>
			<content:encoded><![CDATA[<p>Actually the dog is somewhat irrelevant &#8211; it&#8217;s what comes with it which matters.  One of the side-effects of dog ownership is that you get to spend an hour or so a day out walking, which means you have an hour or so with your own thoughts and no distractions.</p>
<p>I&#8217;m sure everyone has experienced the situation where you go to sleep thinking about a problem, and then wake up in the morning with the answer. It&#8217;s pretty common for your brain to replay the events of the day whilst you sleep and it&#8217;s surprising how just making time to think about something can can help you to clarify and understand problems you&#8217;re working on.  Doing this overnight is good as far as it goes, but can end up somewhat surreal as you effectively relinquish control of your train of thought.</p>
<p>The same principle applies during the day though. If your brain isn&#8217;t immediately occupied with a specific task it will tend to go back over other problems you&#8217;re working on at the moment.  The problem is that in a modern office or lab setting we hardly ever have time when we&#8217;re not being presented with something to do. Even when we&#8217;re not at work it feels really unnatural to not be doing anything.  There&#8217;s always something to read, watch or do so your brain never gets chance to drift back to the topics which could usefully employ it.  That&#8217;s where the dog comes in.</p>
<p>I actually first noticed this effect when I started cycling to work. My commute by bike takes me about an hour (which is why I don&#8217;t do it all that often!), but I found that when I cycled to work I would be hugely productive for the first couple of hours of the day and find really creative solutions for things I was working on.  I first put this down to having more energy from my early morning exercise, but I then found that if I put on a podcast whilst I cycled I was no more productive than if I&#8217;d driven &#8211; so it wasn&#8217;t the cycling, it was having an hour where my brain had nothing to do other than chew over the problems I was working on.</p>
<p>Dogs are even better in this respect in that you don&#8217;t really get a choice about whether you&#8217;re going to take them out.  You&#8217;re forced to spend some time every day with no external distractions from which you can gain the attendant benefits.</p>
<p>Somewhat perversely I find that conferences work this way too, only they provide a double benefit.  Anyone who&#8217;s ever been to a conference will know that talks fall into two categories &#8211; those which capture your attention and provide new and interesting ways of interpreting your science, and those in which you switch off and stop listening in the first five minutes. Actually I&#8217;d argue that both of these types of talk are beneficial.  The first inspires you and gives you new information to process, and the second gives you some uniterrupted time to think about what you learned in the first. Many of the most intersting projects I&#8217;ve worked on have begun during poor talks in a conference, where I&#8217;ve stopped listening to the current talk and have sketched out the structure for a software package, or a theory which I could test.  Granted it doesn&#8217;t always work this way.  I once spent an hour in a particularly dull keynote writing a program which would rate all of the keynotes in a conference by the change in ping times on the wireless network (it&#8217;s conclusions matched surprisingly well with my own personal judegment), but if not productive at least that was creative.</p>
<p>My tip for improved science then is not necessarily to get a dog &#8211; although I can heartily recommend doing so &#8211; it&#8217;s to force yourself to make some time each day where you don&#8217;t have anything to think about.  Your brain will thank you for it.</p>
]]></content:encoded>
			<wfw:commentRss>http://proteo.me.uk/2011/06/want-to-improve-your-science-get-a-dog/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Choosing the best format for raw sequence data</title>
		<link>http://proteo.me.uk/2011/06/choosing-the-best-format-for-raw-sequence-data/</link>
		<comments>http://proteo.me.uk/2011/06/choosing-the-best-format-for-raw-sequence-data/#comments</comments>
		<pubDate>Thu, 16 Jun 2011 09:59:03 +0000</pubDate>
		<dc:creator>simon</dc:creator>
				<category><![CDATA[Bioinformatics]]></category>
		<category><![CDATA[Computing]]></category>
		<category><![CDATA[compression]]></category>
		<category><![CDATA[fastqc]]></category>
		<category><![CDATA[java]]></category>

		<guid isPermaLink="false">http://proteo.me.uk/?p=122</guid>
		<description><![CDATA[Introduction In the current Illumina pipeline raw sequence data is generated in qseq files, but can optionally be converted to the more standard FastQ format for use with other analysis programs.  The FastQ files produced are uncompressed text files and take up a considerable amount of space in our storage system.  We&#8217;ve therefore been thinking [...]]]></description>
			<content:encoded><![CDATA[<h2>Introduction</h2>
<p>In the current Illumina pipeline raw sequence data is generated in qseq files, but can optionally be converted to the more standard FastQ format for use with other analysis programs.  The FastQ files produced are uncompressed text files and take up a considerable amount of space in our storage system.  We&#8217;ve therefore been thinking about either compressing or converting these files to save on the amount of storage they require.</p>
<p>At the same time we&#8217;ve also been expanding the range of compression schemes supported in <a href="http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/">FastQC</a> which gives us a good impression of how quickly we can extract data from the different available formats and since I&#8217;ve collected some data on the storage and processing requirements for the different formats available I thought I&#8217;d share these to help inform others who may be making similar decisions.</p>
<p>The choices we have are to simply compress the fastq files with a standard compression scheme.  The two most commonly used are gzip and bzip2.  Many analysis programs support gzipped fastq files as input in addition to uncompressed files, and a few are starting to support bzip2.  Bzip2 is often chosen over gzip because it compresses data more efficiently.  The other choice is to put the raw data into a <a href="http://samtools.sourceforge.net/SAM1.pdf">BAM file</a>.  This format is specifically designed to hold high throughput sequence data and uses a compression scheme which is designed to be optimal for sequence data.  The BAM format was primarily designed to hold information about sequences which had been mapped to a reference sequence, but it also allows for raw sequences with no associated mapping to be stored, but with some overhead for the mapped position fields which are not used.</p>
<p>For the tests I used a fastq file containing around 500,000 reads of 33bp in length.  The processing times were taken as the time to process the file completely with FastQC.  Since the processing overhead for the QC analysis should be the same in all cases any differences will be attributable to the different amount of data needing to be read from disk, and the CPU time required for the decompression.  The tests were run on a MacBook Pro laptop (so not the fastest hard drive, or the speediest CPU).  FastQC uses pure java decompression code.  For gzip compression this is built into the JRE and for bzip2 I used the <a href="http://code.google.com/p/jbzip2/">jbzip2</a> library.</p>
<h2>Results</h2>
<table>
<tbody>
<tr>
<th>File type</th>
<th>File size</th>
<th>Time to process (seconds)</th>
</tr>
<tr>
<td>Uncompressed FastQ</td>
<td>69.8MB</td>
<td>14.1</td>
</tr>
<tr>
<td>Gzip Compressed FastQ</td>
<td>17.5MB</td>
<td>11.0</td>
</tr>
<tr>
<td>Bzip2 Compressed FastQ</td>
<td>13.9MB</td>
<td>72.1</td>
</tr>
<tr>
<td>BAM</td>
<td>16.3MB</td>
<td>11.5</td>
</tr>
</tbody>
</table>
<h2>Conclusions</h2>
<p>It is clear that converting your raw FastQ files to a more efficient storage format will produce significant gains in disk space usage.  Reducing your storage requirements by a factor of 4-5X and actually making your processing more efficient in some cases is a win-win proposal.</p>
<p>From the results presented it seems clear that unless disk usage is critically important then bzip2 compression is not a viable solution. Increasing your processing time by over 500% for a 20% reduction in size does not seem to be a good trade off.  I&#8217;ve also checked using the command line gzip/bzip2 decompression utilities in case the effect I saw was an artefact of the java implementations, but the size of the difference between the two was similar there as well.</p>
<p>Choosing between gzip and BAM is less clear.  It&#8217;s probably fair to say that at the moment more analysis programs support gzipped fastq files as input than support BAM files (which is more normally seen as an output format), but this may change in the future.  BAM files offer the prospect of adding in mapping data alongside your sequence data with a minimised increase in filesize which may be a benefit to some.  If you&#8217;d prefer to keep your raw data separate from derived data then gzipped FastQ files would seem to be the better choice.</p>
<p>In our case we&#8217;re going to opt for simply gzipping our FastQ files since this seems to be a simple process which won&#8217;t affect any of our existing workflows or processing and which will return to us a significant amount of storage space.</p>
]]></content:encoded>
			<wfw:commentRss>http://proteo.me.uk/2011/06/choosing-the-best-format-for-raw-sequence-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

