Archive for the ‘Computing’ Category
The true cost of object creation in java
Tuesday, December 6th, 2011
I’ve been spending some time trying to optimise the data loading part of one of my java projects. The nature of the data we use means that we have to create hundreds of millions of objects, each of which internally stores only a single long value (it actually stores several fields packed into this value using a bitmask since this is more memory efficient).
When loading our data we are therefore parsing hundreds of millions of long values and creating the associated objects. This can take a few minutes to complete, and having profiled the code it seems that it is the object creation which slows everything down. I therefore did some tests to work out exactly how slow the creation of objects is relative to the primitives which exist in java. My test code is below:
public class CreateTest {
public static void main (String [] args) {
long start = System.currentTimeMillis();
long [] primitives = new long[50000000];
for (int i=0;i<primitives.length;i++) {
primitives[i] = i;
}
long end = System.currentTimeMillis();
System.out.println("Making 50 million longs took "+(end-start)+"ms");
start = System.currentTimeMillis();
Long [] objects = new Long[50000000];
for (int i=0;i<primitives.length;i++) {
objects[i] = new Long(i);
}
end = System.currentTimeMillis();
System.out.println("Making 50 million Longs took "+(end-start)+"ms");
}
}
It’s reasonable to think that there will be an overhead for object creation, but I was surprised by the results:
Making 50 million longs took 199ms
Making 50 million Longs took 10809ms
So that’s a 50-fold overhead for the object wrappers around these numbers. What’s worse is that this overhead seems to happen inside the JVM in such a way that you can’t take advantage of multi-threading to get around it. I tried refactoring the code to have 5 threads creating 10million reads each, and the total runtime across 5 cores was pretty much exactly the same as doing the same thing on a single core. This means that if you want to have 50 million objects available in your program then you’re just going to have to wait 10 seconds for them, however many cores you want to throw at the problem.
I also investigated other options for object creation. Namely I made my object cloneable and then used clone() to create new instances rather than calling the constructor. The constructors for my object are very lightweight, so it was disappointing, but not surprising to see that this had no appreciable effect on the time taken for object creation.
I’ve even toyed with the idea of just storing these objects as an array of longs and avoiding this overhead all together. I could still extract the relevant data by using a set of static methods, but what I can’t then do is to sort these objects (which I need to) since there’s no way to do a custom sort in java without putting the data into objects (which would defeat the point).
I’m therefore stuck with the biggest bottleneck in my program being something which I know is able to be improved by 50X (and would then make everything hugely quick), but not within the confines of the java language.
Tags: java, memory, object
Posted in Computing | Comments Off
Moving over to Casava 1.8
Friday, September 16th, 2011
Introduction
Illumina have recently released an updated version of their downstream analysis software CASAVA. This is the analysis pipeline which runs after the sequencer has processed the raw data down to base call files and provides a variety of functionalities from creating usable base calls to alignment and variant calling. Casava 1.8 makes some major changes to previous versions so this isn’t going to just be a drop in replacement for the workflow you used to have. You’re going to have to think about this one.
As we’ve just got through the processing of our first run with Casava 1.8 I wanted to write up the problems (and in some cases solutions) we’d found both so I can remind myself later on and hopefully help other people who might end up in a similar situation.
Our existing pipeline
Our usage of the Illumina pipeline is fairly straight forward. We use it to generate sequence with qualities and in most cases we use Eland to map to a reference genome. We haven’t done any variant calling through the pipeline and RNA-Seq mapping was done externally with tophat rather than with Illumina’s own tools. We have a LIMS system which manages our samples for us and we link into the run folders created by the sequencer to present data back to the user.
A long long time ago…
In older versions of the pipeline there was a strictly hierarchical set of programs which created a structured set of folders which matched the order in which the programs were run. Each folder was named after the program which created it and the time it was run. The order of execution was:
Images (tiff) > [firecrest] > Intensities (cif) > [Bustard] > Base Calls (qseq) > [Gerald] > Filtered Base Calls (fastq) and mapped data (sorted)
In more recent times much of this processing moved into SCS on the sequencer itself so the pipeline actually started at base calls, leaving only a small script to create qseq files and then run Gerald as normal. You could still run all or part of the full pipeline if you wanted to.
There were good and bad things about this pipeline. On the plus side you had full control over what was run, and it was easy to tell when looking through a run folder exactly what had been done simply by looking at the directory structure. Configuration was achieved mostly within a single gerald config file which said what type of analysis you wanted to run and provided details of the reference genome to use. This config file could be passed to any stage in the pipeline and would pass down the levels so that with a single command you could run the whole pipeline from images to mapped data and be confident it was all going to work.
On the bad side this pipeline got quite clunky if you had to demultiplex your samples. There were also issues with file types – illumina made their own variant of fastq files which used a different quality encoding to everyone else (two different variants actually), they also used a non-standard format for their alignments which required a non-trivial conversion to a more standard format for some downstream applications. Also the final results files were named solely after the lane in which they were sequenced, so you couldn’t tell from the file name alone with run a file had come from, which was a pain when mixing files from different runs in the same analysis. Finally the output files tended to be uncompressed text files, which as outputs got bigger (especially with the advent of the HiSeq) got very big and unwieldy.
Having said all of this, the system worked and had remained largely unchanged since the dawn of the GAII, albeit with different sections moving around between the pipeline server to the IPAR (remember those?), and the control PC.
A new hope…
Casava 1.8 was Illumina’s attempt to address some of the problems which had arisen in older versions. It wasn’t just a tweak on the existing pipeline but rather a major update which fundamentally changed the way in which the pipeline was run and the types and locations of the output files. The update had been planned for many months, with Illumina putting out a list of proposed changes for comment and responding to questions about these well before the release of the new pipeline. The major changes were to be:
- The new pipeline would be based around standard file formats (fastq with sanger encoding, and BAM for alignments)
- The pipeline would have much improved support for barcoded samples
- The pipeline would allow sample annotations to be used to identify samples and help with data management
Concerns were raised about some of these proposals, but reassuring noises were made by Illumina who assured everyone that people with existing workflows would be able to transition smoothly to the new system, which would also make new analyses easier to run.
The Empire Strikes Back…
So now that Casava 1.8 has been released how much does this new pipeline make the life of the bioinformatician easier or harder? Since we’ve been working on our first Casava run I’ll go through the changes we’ve had to make and the problems we’ve hit. It is not at all unlikely that there are simple work rounds for some of our problems, but I’ve spent a long time going through the documentation and if there is an easy way around this then I haven’t yet found it.
Config files
One of the major annoyances in the new version is the lack of consistency in the config files required to process a run. Up until now there was only one config file (gerald.conf) which laid out the type of analysis required, either globally or on a per-lane basis. With the new version there are now two separate config files required along with many more command line options.
In the new casava the gerald.conf file is still present (although now called config.txt) with substantially the same options as before, but this is now supplemented by a compulsory sample sheet csv file which must be provided when processing of the run starts. The sample sheet allows you to specify a flowcell id, sample name, description and associated project name along with any barcode sequences, a reference genome, sequencing recipe, operator and a flag to say if the sample is a control. You must provide a sample sheet to complete base calling and the sample names and project names are used to name the output files from base calling. Any listed barcodes are used to split the sequences into their respective subgroups.
There are a couple of problems here. Firstly there’s nothing in the documentation which says which fields in the sample sheet are required and which are optional. It appears that the only things you actually need are the lane, sample name and project name fields. You can specify a flowcell id but it’s not validated against the run you’re processing. You can specify a reference genome, but it’s not actually used when you do an alignment – you have to specify it again when you write your gerald.conf file. The other fields seem to only be there for your records and not because the pipeline uses them.
If, like us, you already have a LIMS system for managing your samples then having to write out a blank sample sheet I don’t need is an annoyance I could do without. Also, since our LIMS needs to automatically recognise the output files from a run folder this is substantially harder to do when there is no standard default blank sample sheet. It’s going to take a lot more parsing to find the files associated with a given lane than in the previous release although since either the prefix or suffix of the files involved is controlled this is still possible.
Next there is the problem of configuration. In the old system there was a single configuration file which you presented at the top of your analysis stack and which passed down to all levels. In the new system it’s much more of a mess.
- For base calling you need to provide a sample sheet.
- For alignment you need to provide a gerald conf file
- For BAM creation and variant calling you need to specify the reference genome on the command line and do this separately for every sample.
That’s 3 different places you can set the reference genome, only 2 of which are ever used! The reference genome now also needs to be a set of fasta files which need to be in files ending with .fa (mine ended in .txt but the docs don’t say that, you just work it out when you get an error when you try to use them).
Whereas the transition from one stage of the pipeline to the next used to be handled via a recursive make, the new system now works by letting you specify a command to run after the current analysis has successfully completed. This is in some ways more flexible as it allows you to insert your own custom scripts into the workflow, but it’s also a real pain to work out what set of commands you’re going to need to run to get to the very end of the pipeline, especially as these are going to reference folder names which don’t yet exist, and whose names are dynamically generated from the data in the sample sheet. Add to this that BAM file creation is now a per-sample process and this means that configuring a full run is now a major undertaking, and one which is very fragile since the pipeline itself cannot validate the options for anything beyond the initial analysis step. Unless you have a completely standard analysis then this is going to put an end to running stuff over the weekend and being sure that it will have finished successfully when you come back on Monday!
Output files
The move to use standard formats for output files should have made life easier for users of the pipeline, but in practice it’s actually harder to work with the new system than the old one.
One change which has been introduced, probably due to the number of sequences generated by hi-seq machines, is that you now get a set of output files rather than a single file for all of the analyses. This was sometimes the case for internal formats in previous releases (eg qseq files), but not it applies to some of the final outputs. The sets of files have a common start and end with a grouping number which let you know how to recombine them. So, for a GAIIx lane you might expect to get up to 8 gzipped fastq files for each lane. You can pass an option to increase the number of sequences in each file , but if you make that limit over 16million then it breaks eland, so you’re really stuck with multiple output files. This same problem affects eland output as well with multiple gzipped export files.
Another problem with the fastq output is that unlike older versions of the pipeline the fastq files in Casava 1.8 are not filtered, so you’ll have an entry for every cluster on your flowcell. A flag in the header tells you if your sequence should be filtered, but none of the downstream programs will understand that, so you’re going to have to do it yourself, which takes away the point of having fastq files assembled for you already!
Illumina do acknowledge both of these problems in the documentation, and provide documented workrounds. Unfortunately both of the workrounds they provide are broken – they say that you can simply cat the gzipped fastq files together to combine them, but unfortunately this puts gzip headers in the middle of your file. Some decompressors (gzip itself for example) can understand this and will extract all of the data, but others (the standard java gzip extractor for example) do not, and just stop extracting data (with no warnings or errors) after the end of the first compressed section. There is also a somewhat complex grep statement provided which should filter out flagged sequences to produce a fastq file, but this was reported to create broken fastq files from some inputs, and leaves you with uncompressed data.
We ended up writing our own short Perl script which filtered, combined and re-compressed the fastq files but this (well the compressing part) is a slow process which we’re currently performing manually.
The other standard output file used in the new version is BAM files for aligned output. What wasn’t said though is that this is not the default output from eland (which is still export files, albeit now gzipped). Creating BAM output requires that you run part of the variant calling pipeline to run the sort and bam targets to turn your export files into a bam. There is a stand alone script to to this without going through variant calling, but this only produces SAM files. There is also a –bam option for Eland, but you can only set this (according to the docs at any rate) if you run Eland as a stand alone program, and not by running the actual pipeline. The documentation for this part of the pipeline is tortuous with at least 3 different way to create BAM files all with their own options and requirements.
When you do finally create a BAM you find that in order to do this you must create an arbitrarily named directory (so no chance of finding it with our LIMS), which contains a completely generically named bam file (sorted.bam), so really no way to associate this robustly with a sample, along with separate bam files for every chromosome in your reference genome, even if you’re not doing any other analysis. All to get the mapped data in an output you can use with any other downstream application.
Finally, the base calls, alignments and variant calls are now not stored in a structured heirarchy under the data folder, but get their own top level folders inside the run folder (base calls go in ‘Unaligned’, export files go in ‘Aligned’ and BAM files go wherever you tell them, with no defaults). What this means in effect is that you can only run one set of base calls and alignments if you need to be able to reliably predict where your output will go. Subsequent runs will fail if the output folder already exists. I realise many people won’t ever both reprocessing their data, but we do this all the time and this change is going to limit what we can do. We can work around it locally, but we also distribute a LIMS which has to find these files on other sites and this gets to be impractical in the new system (and impossible in the case of variant calling output).
The bioinformatician strikes back…
So what is the outcome of all of this?
In our case the outcome is that we’re going to:
- Use a standard default sample sheet for all runs so we at least get consistent directory naming
- Rewrite our LIMS to be able to recognise the new folder structure generated from our standard sample sheet
- Write a script to filter and combine the multiple fastq files into a single file we can give back to our users
- Stop using Eland for our mapping. It’s just too much of a pain to put this into the pipeline and link the results with our LIMS. We were using bowtie and tophat for secondary mappings anyway but these will take over as our primary mapping pipeline.
- Work out a nasty hack to allow our LIMS to recognise reprocessed folders – which won’t work on any site other than ours.
All in all, for our use case this update has made our pipeline significantly less functional than it used to be. I can see that there are probably people who would benefit from some of the changes here, but they are far from being the promised panacea. I’d be interested to know how many groups are actually doing any downstream analysis in Casava, as opposed to stopping after base calling and using other options. I got used to getting funny looks when I said we were using Eland, and I suspect these changes will make the users of Casava for analysis an ever rarer breed.
With luck some or all of the problems we’ve experienced will be rectified in future releases. But I think once we’ve moved away from doing anything beyond base calling in Casava then I don’t see us moving back.
Addenda…
Since posting this originally I’ve received some feedback which might be useful.
It appears that you can run Casava without a sample sheet at all, even though the documentation makes no mention of this, and actually says it will read one from a default location if one isn’t specified. The output directories are named after the flowcell id for the the project name, and the lane number for the samples – which is way better than using a minimal default sample sheet.
In a more recent update Illumina say they are intending to release an update to Casava (1.8.1) which will change the behaviour of the BCL converter to not include reads which failed QC by default. They will also provide an optional flag to write all of the sequence output into a single file. This should solve our problems with this part of the pipeline.
Tags: casava, illumina, sequencing
Posted in Bioinformatics, Computing | Comments Off
Getting the java heap size you asked for
Friday, August 26th, 2011
In a recent post I discussed a method we’re using for automatically setting the java heap size appropriately at runtime. It now turns out that the issue of setting the heap size is complicated by the fact that the heap size you request on the command line isn’t necessarily what you get given. In some cases the differences are modest, but sometimes they can be significant – amounting to hundreds of megabytes of discrepancy.
The simple test I did was to compare the heap size requested by setting the -Xmx value on the java command with the actual amount of available memory as reported by Runtime.getRuntime().maxMemory(). What I found was that the relationship between these two values isn’t 1:1, isn’t fixed at a given ratio, and is platform (and indeed VM) dependent.
According to this bug report the actual implementation of -Xmx is VM-dependent, so that the value you supply on the command line is merely a suggestion to the VM and it’s free to do whatever it likes. Because I’d like my software to work consistently on all platforms I therefore had a look at what the different VMs actually do.
The OSX VM actually stays very close to the requested amount of memory across the whole range of requested heap sizes. The linux and windows VMs though overcommit at small heap sizes (there seems to be a minimum allowed heap size of ~10MB), but undercommit by up to 12% at larger heap sizes. When you’re requesting a heap of serveral gigabytes in size a 12% loss is a significant amount of memory.
Our immediate solution to this problem is to do a trial run where we launch a small program which reports the actual heap size allocated. We then relaunch the normal java command, increasing the heap request size by a correction factor calculated from the trail run. This seems to produce consistent results on all platforms and gives us what we asked for in the first place.
Mac application bundle caching
Wednesday, August 10th, 2011
Having spent a frustrating hour or so trying to update a mac application bundle I thought I’d share a couple of things which caused no end of confusion and aren’t what you’d expect and are therefore likely to catch out those working with application bundles for the first time.
Basically I was finding that although I was making changes to my bundle these changes weren’t actually being applied. The bundle seemed to hold on to older versions of the data in the bundle even if this had been changed or even deleted.
It turns out that OSX caches information from the application bundle and then doesn’t check to see if the information in the bundle has been updated unless it is forced to. There are two separate caches which can affect you, a cache of the bundle icons, and a cache of the Info.plist file.
Icon cache
Inside your application bundle you can create a file containing a series of icons for your application. You make this file from a series of individual bitmap files normally by using the icon composer application which comes with the developer tools in OSX. What you will find is that if you update your icns file containing your application icons you won’t actually see any difference in the icons which are shown in the finder. The reason for this is that OSX caches the icons associated with files in a folder in a hidden file called .DS_Store. If this file is present then the finder will read the icon data out of that even if the icns file in your application bundle has been updated.
If you want to force the finder to recalculate the DS_Store file then you need to do the following:
- Make sure you have closed all finder windows showing your folder of interest
- From within the terminal delete the .DS_Store file from your folder of interest
- Relaunch the finder using Apple > Force Quit > Finder > Relaunch (this won’t affect any of your running applications)
This will force the finder to recalculate the icon cache for your folder and any changes to your icns file will be visible.
Info.plist cache
The Info.plist file in your application bundle provides all of the basic information about how to launch your application, where to find the icon file and various settings relating to the applications name and version. As with the application icons the information in the Info.plist file is cached by the OS, such that if you change the data in the file (say you change the name of the executable you want to run), then the bundle will remember the old value not the new one.
As an aside, if you’re having problems running an application bundle, the easiest way to see what’s actually going wrong is to open the Applications > Utilities > Console application and then relaunch your application. Any output generated on stdout or stderr by an application bundle will show up in this application and you can see any error messages which are generated.
Clearing the Info.plist cache is a bit easier than clearing the icon cache. The quickest way to reset it is to simply move your application to a different folder (say your desktop), and then move it back. This will reset the Info.plist cache and your changes should be applied.
Dynamically setting the java heap size at runtime
Friday, July 29th, 2011
One of the oddities about java programs is that they require you to set a maximum heap size when you start the program. What this means in effect is that you need to be able to predict the memory usage of your program before it starts, and whatever heap size you set needs to be appropriate for all of the system the program is going to run on and all of the datasets it will handle.
When you’re distributing a desktop application which needs to process tens of gigabytes of data this can be a problem. Ideally you’d like to set a heap size of a few gigabytes to give yourself enough overhead to process even the largest of datasets. However not all machines have that much RAM installed, and even if they do they require a 64 bit OS to be able to use more than 2GB of it on a single process.
Up until now we’ve resorted to setting a lowest common denominator heap size (1500M), and providing instructions for increasing this on systems which can handle it. This is however very inelegant and means we have to warn users if they’re running out of memory and make them save, reconfigure and restart the application.
We have now moved over to using a system which dynamically sets the heap size to an appropriate value at runtime. We do this by writing a Perl wrapper which launches the java application after having worked out the most appropriate heap size for the current system.
To do this we have to work out:
- Whether we have a 32 or 64 bit JRE to work with
- The amount of physical RAM in the machine
The heap size is set as 2/3 of the physical RAM which leaves enough overhead for the JRE and basic system processes. In addition we set a ceiling on the heap size. For 32 bit systems this is 1500M which is the most you can practically use given the 2GB per process limit (you have to leave something for the JRE itself). For 64 bit systems we set the ceiling at 6GB. It’s not in our interest to set the heap size too high as this ends up resulting in long freezes during garbage collection, so we set it to the largest size we’re practically going to need. We work out if we’re on a 64-bit system by parsing the output of java -version (it doesn’t matter if the OS is 64 bit if the JRE is still 32 bit).
Finding the amount of phyiscal RAM is a platform specific task. On windows we have to use the Win32 API. Under linux we parse the output of ‘free’, and on OSX and the BSDs we parse the output of top.
If the user doesn’t like our auto-configured heap size we allow them to override it by passing a -m argument to the wrapper script.
For unix-like OSs we don’t need to worry about perl being present, but for windows we compile the script into a windows binary using pp. The Win32::API module is bundled into this binary. On other platforms we don’t need to distribute this since it’s loaded dynamically at runtime only if perl’s $^O variable tells us we’re running under windows. Under OSX we can run the wrapper nicely from the command line. We’re still working out the best way to include this as part of an application bundle.
This isn’t perhaps the cleanest of solutions, but compared to the very manual process we had before it’s a lot easier for the users to get their systems set up optimally.
#!/usr/bin/perl use warnings; use strict; use English; use FindBin qw($RealBin); use Getopt::Long; use IPC::Open3; my @java_args;
# See if they manually set a heap size
my $memory;
my $result = GetOptions(
'memory=i' => \$memory,
);
if ($memory) {
if ($memory < 500) {
die "Memory allocation must be at least 500M";
}
}
else {
$memory = determine_optimal_memory();
}
unshift @java_args,"-Xmx${memory}m";
exec "java",@java_args, "uk.ac.bbsrc.babraham.SeqMonk.SeqMonkApplication";
sub print_error {
# We wrap errors like this so we can keep a windows shell window open
# so the user can see any errors we generate
my ($error) = @_;
warn $error;
$_ = <STDIN>;
exit 1;
}
sub determine_optimal_memory {
# We'll set a ceiling for the memory allocation. On a 32-bit OS this is going
# to be 1500m (the max it can safely handle), on a 64-bit OS we won't take more
# than 6GB
my $max_memory = 1500;
# We need not only a 64 bit OS but 64 bit java as well. It's easiest to just test
# java since the OS support must be there if you have a 64 bit JRE.
my ($in,$out);
open3(\*IN,\*OUT,\*OUT,"java -version") or print_error("Can't find java");
close IN;
while (<OUT>) {
if (/64-Bit/) {
$max_memory = 6000;
}
}
close OUT;
warn "Memory ceiling is $max_memory\n";
# The way we determine the amount of physical memory is OS dependent.
my $os = $^O;
my $physical;
if ($os =~ /Win/) {
$physical = get_windows_memory($max_memory);
}
elsif ($os =~/darwin/ or $os =~ /bsd/i) {
$physical = get_osx_memory($max_memory);
}
else {
$physical = get_linux_memory($max_memory);
}
warn "Raw physical memory is $physical\n";
# We then set the memory to be the minimum of 2/3 of the physical
# memory or the ceiling, whichever is lower.
$physical = int(($physical/3)*2);
if ($max_memory < $physical) {
return $max_memory;
}
warn "Using $physical MB of RAM to launch seqmonk\n";
return $physical;
}
sub get_linux_memory {
# We get the amount of physical memory on linux by parsing the output of free
open (MEM,"free -m |") or print_error("Can't launch free on linux: $!");
while (<MEM>) {
if (/^Mem:\s+(\d+)/) {
return $1;
}
}
close MEM;
print_error("Couldn't parse physical memory from the output of free");
}
sub get_osx_memory {
# We get the amount of physical memory on OSX by parsing the output of top
open (MEM,"top -l 1 -n 0 |") or print_error("Can't get amount of memory on OSX: $!");
my $total_mem = 0;
while (<MEM>) {
if (/^PhysMem:.*?(\d+)M\s+used,\s+(\d+)M\s+free/) {
$total_mem += $1;
$total_mem += $2;
}
}
close MEM;
unless ($total_mem) {
print_error("Could't parse physical memory from the output of top");
}
return $total_mem;
}
sub get_windows_memory {
warn "Getting windows physical memory\n";
# This code was adapted from an answer posted by Tom Feiner on
# stackoverflow
#
# http://stackoverflow.com/questions/423797/how-can-i-find-the-exact-amount-of-physical-memory-on-windows-x86-32bit-using-per
my ($max_memory) = @_;
eval {
require Win32::OLE;
Win32::OLE->import qw( EVENTS HRESULT in );
1;
} or do {
print_error("Couldn't load Win32 module to determine windows memory");
};
my $WMI = Win32::OLE->GetObject( "winmgmts:{impersonationLevel=impersonate,(security)}//./" ) || print_error ("Could not get Win32 object: $OS_ERROR");
my $total_capacity = 0;
foreach my $object (in($WMI->InstancesOf( 'Win32_PhysicalMemory' ))) {
$total_capacity += $object->{Capacity};
}
my $total_capacity_in_mb = $total_capacity / (1024*1024);
return $total_capacity_in_mb;
}
Tags: java, software
Posted in Computing | 1 Comment »
Want to improve your science? Get a dog.
Sunday, June 19th, 2011
Actually the dog is somewhat irrelevant – it’s what comes with it which matters. One of the side-effects of dog ownership is that you get to spend an hour or so a day out walking, which means you have an hour or so with your own thoughts and no distractions.
I’m sure everyone has experienced the situation where you go to sleep thinking about a problem, and then wake up in the morning with the answer. It’s pretty common for your brain to replay the events of the day whilst you sleep and it’s surprising how just making time to think about something can can help you to clarify and understand problems you’re working on. Doing this overnight is good as far as it goes, but can end up somewhat surreal as you effectively relinquish control of your train of thought.
The same principle applies during the day though. If your brain isn’t immediately occupied with a specific task it will tend to go back over other problems you’re working on at the moment. The problem is that in a modern office or lab setting we hardly ever have time when we’re not being presented with something to do. Even when we’re not at work it feels really unnatural to not be doing anything. There’s always something to read, watch or do so your brain never gets chance to drift back to the topics which could usefully employ it. That’s where the dog comes in.
I actually first noticed this effect when I started cycling to work. My commute by bike takes me about an hour (which is why I don’t do it all that often!), but I found that when I cycled to work I would be hugely productive for the first couple of hours of the day and find really creative solutions for things I was working on. I first put this down to having more energy from my early morning exercise, but I then found that if I put on a podcast whilst I cycled I was no more productive than if I’d driven – so it wasn’t the cycling, it was having an hour where my brain had nothing to do other than chew over the problems I was working on.
Dogs are even better in this respect in that you don’t really get a choice about whether you’re going to take them out. You’re forced to spend some time every day with no external distractions from which you can gain the attendant benefits.
Somewhat perversely I find that conferences work this way too, only they provide a double benefit. Anyone who’s ever been to a conference will know that talks fall into two categories – those which capture your attention and provide new and interesting ways of interpreting your science, and those in which you switch off and stop listening in the first five minutes. Actually I’d argue that both of these types of talk are beneficial. The first inspires you and gives you new information to process, and the second gives you some uniterrupted time to think about what you learned in the first. Many of the most intersting projects I’ve worked on have begun during poor talks in a conference, where I’ve stopped listening to the current talk and have sketched out the structure for a software package, or a theory which I could test. Granted it doesn’t always work this way. I once spent an hour in a particularly dull keynote writing a program which would rate all of the keynotes in a conference by the change in ping times on the wireless network (it’s conclusions matched surprisingly well with my own personal judegment), but if not productive at least that was creative.
My tip for improved science then is not necessarily to get a dog – although I can heartily recommend doing so – it’s to force yourself to make some time each day where you don’t have anything to think about. Your brain will thank you for it.
Posted in Bioinformatics, Computing | Comments Off
Choosing the best format for raw sequence data
Thursday, June 16th, 2011
Introduction
In the current Illumina pipeline raw sequence data is generated in qseq files, but can optionally be converted to the more standard FastQ format for use with other analysis programs. The FastQ files produced are uncompressed text files and take up a considerable amount of space in our storage system. We’ve therefore been thinking about either compressing or converting these files to save on the amount of storage they require.
At the same time we’ve also been expanding the range of compression schemes supported in FastQC which gives us a good impression of how quickly we can extract data from the different available formats and since I’ve collected some data on the storage and processing requirements for the different formats available I thought I’d share these to help inform others who may be making similar decisions.
The choices we have are to simply compress the fastq files with a standard compression scheme. The two most commonly used are gzip and bzip2. Many analysis programs support gzipped fastq files as input in addition to uncompressed files, and a few are starting to support bzip2. Bzip2 is often chosen over gzip because it compresses data more efficiently. The other choice is to put the raw data into a BAM file. This format is specifically designed to hold high throughput sequence data and uses a compression scheme which is designed to be optimal for sequence data. The BAM format was primarily designed to hold information about sequences which had been mapped to a reference sequence, but it also allows for raw sequences with no associated mapping to be stored, but with some overhead for the mapped position fields which are not used.
For the tests I used a fastq file containing around 500,000 reads of 33bp in length. The processing times were taken as the time to process the file completely with FastQC. Since the processing overhead for the QC analysis should be the same in all cases any differences will be attributable to the different amount of data needing to be read from disk, and the CPU time required for the decompression. The tests were run on a MacBook Pro laptop (so not the fastest hard drive, or the speediest CPU). FastQC uses pure java decompression code. For gzip compression this is built into the JRE and for bzip2 I used the jbzip2 library.
Results
| File type | File size | Time to process (seconds) |
|---|---|---|
| Uncompressed FastQ | 69.8MB | 14.1 |
| Gzip Compressed FastQ | 17.5MB | 11.0 |
| Bzip2 Compressed FastQ | 13.9MB | 72.1 |
| BAM | 16.3MB | 11.5 |
Conclusions
It is clear that converting your raw FastQ files to a more efficient storage format will produce significant gains in disk space usage. Reducing your storage requirements by a factor of 4-5X and actually making your processing more efficient in some cases is a win-win proposal.
From the results presented it seems clear that unless disk usage is critically important then bzip2 compression is not a viable solution. Increasing your processing time by over 500% for a 20% reduction in size does not seem to be a good trade off. I’ve also checked using the command line gzip/bzip2 decompression utilities in case the effect I saw was an artefact of the java implementations, but the size of the difference between the two was similar there as well.
Choosing between gzip and BAM is less clear. It’s probably fair to say that at the moment more analysis programs support gzipped fastq files as input than support BAM files (which is more normally seen as an output format), but this may change in the future. BAM files offer the prospect of adding in mapping data alongside your sequence data with a minimised increase in filesize which may be a benefit to some. If you’d prefer to keep your raw data separate from derived data then gzipped FastQ files would seem to be the better choice.
In our case we’re going to opt for simply gzipping our FastQ files since this seems to be a simple process which won’t affect any of our existing workflows or processing and which will return to us a significant amount of storage space.
Tags: compression, fastqc, java
Posted in Bioinformatics, Computing | Comments Off
Selecting large random integers in Perl
Monday, April 11th, 2011
We had a very odd bug in a simulation we were writing recently. We were supposed to be sampling from a large pool of possible data, but were getting a very weird distribution of values. After much debugging we found a most unusual cause.
Here is the pop quiz – read through the short script below and take your best guess at what the output will be. Being correct within 5% is good enough.
#!/usr/bin/perl use warnings; use strict;
my %seen;
for (1..10000000) {
my $rand = int(rand(1000000));
++$seen{$rand};
}
print "I saw ".(scalar keys %seen)." different values\n";
I should point out that the random number generation here is done according to the perl documentation, which simply says:
“Apply “int()” to the value returned by “rand()” if you want random integers instead of random fractional numbers. For example,
int(rand(10))
returns a random integer between 0 and 9, inclusive.”
OK, have you guessed – well the answer we got was:
I saw 32768 different values
Yes, that’s right, after selecting 10 million values from a range of 0-999,999 we only actually saw just over 32 thousand different values. This was the reason our distribution looked odd – we were only seeing around 2% of the values we could have seen.
It turns out that the cause of this oddity is platform specific. Perl doesn’t itself include code to generate random numbers – it simply makes a call to the random number library supplied by the underlying operating system. In our case this code was being run on 64-bit Activeperl under Windows 7, and the standard windows random number library is only capable of generating 32768 different values (15 bits of randomness).
If we take this exact code and run it under Linux we get:
I saw 999950 different values
..and our simulation returns sensible numbers.
This appears to be pretty poor – I’m sure the perl people will just blame the microsoft library, but this could be worked around in the perl implementation, or at the very least a note should be added to the rand() documentation to specifically warn that there is a precision limit to rand, and that this might be very low on some platforms.
Fortunately however there are some proper work rounds for this problem. If you need to reliably generate random numbers from a large range in Perl then there are a few modules which provide more fully featured random number generators than the default rand() function. Two of the most popular are Math::Random and Math::Random::MT, either of which will work reliably and consistently on all platforms.
Tags: integer, Perl, random
Posted in Computing | Comments Off
Recovered from defacing
Sunday, April 10th, 2011
Anyone visiting the site in the last couple of days may have found that I appeared to have simultaneously developed a fanatical interest in middle eastern politics, and a very poor taste in music. Whilst it’s very convenient to have a simple to use blogging engine such as WordPress to use, I guess the downside is that you occasionally make yourself the target of automated attacks.
It’s not completely clear how the site was compromised – there’s only one admin account on the site with a pretty unusual password (now changed) which wasn’t recycled elsewhere so that seems an unlikely avenue of attack. Some kind of compromise within WordPress seems more likely. The site was running the most recent available update of WordPress available from my web host, but there was an additional point update which was available but not applied. I guess I need to learn my lesson and make sure it’s always running the latest release, even if that means doing the update myself.
Fortunately I had a full backup of the site to fall back on (whoever defaced the site also kindly deleted all of the existing content) so I was able to put everything back. I did however find that even if I didn’t have a backup that I could recover every post I’d ever written from google’s cache, so it’s very generous of them to provide such an efficient backup service for the whole internet.
Hopefully now that I’m completely up to date with my wordpress version there shouldn’t be any more problems on the site. Apologies for any inconvenience caused whilst the site was a mess.
Tags: backup, defaced, hack, recovered, wordpress
Posted in Computing | Comments Off
How good is ‘good enough’ for research software
Sunday, September 12th, 2010
There are two linked problems which seem to face me with every piece of software I write for research use:
- When is the software complete enough to write a paper on it
- How to manage the versions and project description
I think that although similar questions arise within software written for general use, their answers are quite different for research software.
In a general software project it’s not unreasonable to set out with an idea of what you want the software to do and to have a list of features which, once complete, would mark the project as ready for a stable release. Following the standard conventions then works reasonably well:
- Alpha release – some features not implemented, any feature could be subject to change
- Beta release – all (or nearly all) features implemented but bugs may still exist
- Stable release – all features should be complete and code should be well tested to eliminate major bugs
On our projects page we used to put our software into these categories, but soon found that they really didn’t work well for research based software. Most crucially, in normal software development progressing from alpha – beta – stable indicates an increase in quality, but in research software it peaks at beta, and stable is probably a bad sign. For research software we now use a system more like the following:
- Alpha – Software is still experimental. It’s not been widely tested so treat all derived results with a healthy dose of scepticism and double check everything. File formats and major concepts could still be subject to radical change.
- Beta – The way the program works is mostly stable and the released functionality has been pretty well tested and results should mostly be reliable. Specific details are still subject to change and more functionality could be added at any point, but saved files should still continue to work.
- Legacy – Generally used for projects which aren’t being worked on any more. If the functionality of the last release does what you want, then great, but no new functionality will be added, although obvious bugs will be fixed.
The most successful and active project are therefore under a pattern of continuous rolling development internally to adapt to the ongoing requirements of the research. Public releases tend to happen either when a reasonably serious bug is found, or when a bunch of new features are complete and everything in the menus actually works.
The problem with this kind of development is that no project is ever ‘finished’, which means there’s no obvious point at which to pause to say ‘It’s time to write this up as a paper’. There’s always one more thing you want to get finished before publishing. Since our group is a core facility rather than a research group we’re not really under pressure to produce our own papers, since our focus is to provide a service – and this makes the problem worse. We’ve had a few projects now where we kept deferring writing a paper to the point where the project moved into the ‘legacy’ phase and it was too late.
My new resolution therefore is to be less cautious about this. As soon as a project is producing scientifically useful results and is a state of completion where I’d be happy for other people to use it without putting warning stickers all over it we should aim to get a paper out on it. This will both help to promote work that we’re doing and get more people involved, but may also help to focus our attention on exactly which features we want to complete before committing to the publication.
Tags: research, software
Posted in Bioinformatics, Computing | Comments Off
