The true cost of object creation in java
December 6th, 2011
I’ve been spending some time trying to optimise the data loading part of one of my java projects. The nature of the data we use means that we have to create hundreds of millions of objects, each of which internally stores only a single long value (it actually stores several fields packed into this value using a bitmask since this is more memory efficient).
When loading our data we are therefore parsing hundreds of millions of long values and creating the associated objects. This can take a few minutes to complete, and having profiled the code it seems that it is the object creation which slows everything down. I therefore did some tests to work out exactly how slow the creation of objects is relative to the primitives which exist in java. My test code is below:
public class CreateTest {
public static void main (String [] args) {
long start = System.currentTimeMillis();
long [] primitives = new long[50000000];
for (int i=0;i<primitives.length;i++) {
primitives[i] = i;
}
long end = System.currentTimeMillis();
System.out.println("Making 50 million longs took "+(end-start)+"ms");
start = System.currentTimeMillis();
Long [] objects = new Long[50000000];
for (int i=0;i<primitives.length;i++) {
objects[i] = new Long(i);
}
end = System.currentTimeMillis();
System.out.println("Making 50 million Longs took "+(end-start)+"ms");
}
}
It’s reasonable to think that there will be an overhead for object creation, but I was surprised by the results:
Making 50 million longs took 199ms
Making 50 million Longs took 10809ms
So that’s a 50-fold overhead for the object wrappers around these numbers. What’s worse is that this overhead seems to happen inside the JVM in such a way that you can’t take advantage of multi-threading to get around it. I tried refactoring the code to have 5 threads creating 10million reads each, and the total runtime across 5 cores was pretty much exactly the same as doing the same thing on a single core. This means that if you want to have 50 million objects available in your program then you’re just going to have to wait 10 seconds for them, however many cores you want to throw at the problem.
I also investigated other options for object creation. Namely I made my object cloneable and then used clone() to create new instances rather than calling the constructor. The constructors for my object are very lightweight, so it was disappointing, but not surprising to see that this had no appreciable effect on the time taken for object creation.
I’ve even toyed with the idea of just storing these objects as an array of longs and avoiding this overhead all together. I could still extract the relevant data by using a set of static methods, but what I can’t then do is to sort these objects (which I need to) since there’s no way to do a custom sort in java without putting the data into objects (which would defeat the point).
I’m therefore stuck with the biggest bottleneck in my program being something which I know is able to be improved by 50X (and would then make everything hugely quick), but not within the confines of the java language.
Tags: java, memory, object
Posted in
Computing |
No Comments »
Moving over to Casava 1.8
September 16th, 2011
Introduction
Illumina have recently released an updated version of their downstream analysis software CASAVA. This is the analysis pipeline which runs after the sequencer has processed the raw data down to base call files and provides a variety of functionalities from creating usable base calls to alignment and variant calling. Casava 1.8 makes some major changes to previous versions so this isn’t going to just be a drop in replacement for the workflow you used to have. You’re going to have to think about this one.
As we’ve just got through the processing of our first run with Casava 1.8 I wanted to write up the problems (and in some cases solutions) we’d found both so I can remind myself later on and hopefully help other people who might end up in a similar situation.
Our existing pipeline
Our usage of the Illumina pipeline is fairly straight forward. We use it to generate sequence with qualities and in most cases we use Eland to map to a reference genome. We haven’t done any variant calling through the pipeline and RNA-Seq mapping was done externally with tophat rather than with Illumina’s own tools. We have a LIMS system which manages our samples for us and we link into the run folders created by the sequencer to present data back to the user.
A long long time ago…
In older versions of the pipeline there was a strictly hierarchical set of programs which created a structured set of folders which matched the order in which the programs were run. Each folder was named after the program which created it and the time it was run. The order of execution was:
Images (tiff) > [firecrest] > Intensities (cif) > [Bustard] > Base Calls (qseq) > [Gerald] > Filtered Base Calls (fastq) and mapped data (sorted)
In more recent times much of this processing moved into SCS on the sequencer itself so the pipeline actually started at base calls, leaving only a small script to create qseq files and then run Gerald as normal. You could still run all or part of the full pipeline if you wanted to.
There were good and bad things about this pipeline. On the plus side you had full control over what was run, and it was easy to tell when looking through a run folder exactly what had been done simply by looking at the directory structure. Configuration was achieved mostly within a single gerald config file which said what type of analysis you wanted to run and provided details of the reference genome to use. This config file could be passed to any stage in the pipeline and would pass down the levels so that with a single command you could run the whole pipeline from images to mapped data and be confident it was all going to work.
On the bad side this pipeline got quite clunky if you had to demultiplex your samples. There were also issues with file types – illumina made their own variant of fastq files which used a different quality encoding to everyone else (two different variants actually), they also used a non-standard format for their alignments which required a non-trivial conversion to a more standard format for some downstream applications. Also the final results files were named solely after the lane in which they were sequenced, so you couldn’t tell from the file name alone with run a file had come from, which was a pain when mixing files from different runs in the same analysis. Finally the output files tended to be uncompressed text files, which as outputs got bigger (especially with the advent of the HiSeq) got very big and unwieldy.
Having said all of this, the system worked and had remained largely unchanged since the dawn of the GAII, albeit with different sections moving around between the pipeline server to the IPAR (remember those?), and the control PC.
A new hope…
Casava 1.8 was Illumina’s attempt to address some of the problems which had arisen in older versions. It wasn’t just a tweak on the existing pipeline but rather a major update which fundamentally changed the way in which the pipeline was run and the types and locations of the output files. The update had been planned for many months, with Illumina putting out a list of proposed changes for comment and responding to questions about these well before the release of the new pipeline. The major changes were to be:
- The new pipeline would be based around standard file formats (fastq with sanger encoding, and BAM for alignments)
- The pipeline would have much improved support for barcoded samples
- The pipeline would allow sample annotations to be used to identify samples and help with data management
Concerns were raised about some of these proposals, but reassuring noises were made by Illumina who assured everyone that people with existing workflows would be able to transition smoothly to the new system, which would also make new analyses easier to run.
The Empire Strikes Back…
So now that Casava 1.8 has been released how much does this new pipeline make the life of the bioinformatician easier or harder? Since we’ve been working on our first Casava run I’ll go through the changes we’ve had to make and the problems we’ve hit. It is not at all unlikely that there are simple work rounds for some of our problems, but I’ve spent a long time going through the documentation and if there is an easy way around this then I haven’t yet found it.
Config files
One of the major annoyances in the new version is the lack of consistency in the config files required to process a run. Up until now there was only one config file (gerald.conf) which laid out the type of analysis required, either globally or on a per-lane basis. With the new version there are now two separate config files required along with many more command line options.
In the new casava the gerald.conf file is still present (although now called config.txt) with substantially the same options as before, but this is now supplemented by a compulsory sample sheet csv file which must be provided when processing of the run starts. The sample sheet allows you to specify a flowcell id, sample name, description and associated project name along with any barcode sequences, a reference genome, sequencing recipe, operator and a flag to say if the sample is a control. You must provide a sample sheet to complete base calling and the sample names and project names are used to name the output files from base calling. Any listed barcodes are used to split the sequences into their respective subgroups.
There are a couple of problems here. Firstly there’s nothing in the documentation which says which fields in the sample sheet are required and which are optional. It appears that the only things you actually need are the lane, sample name and project name fields. You can specify a flowcell id but it’s not validated against the run you’re processing. You can specify a reference genome, but it’s not actually used when you do an alignment – you have to specify it again when you write your gerald.conf file. The other fields seem to only be there for your records and not because the pipeline uses them.
If, like us, you already have a LIMS system for managing your samples then having to write out a blank sample sheet I don’t need is an annoyance I could do without. Also, since our LIMS needs to automatically recognise the output files from a run folder this is substantially harder to do when there is no standard default blank sample sheet. It’s going to take a lot more parsing to find the files associated with a given lane than in the previous release although since either the prefix or suffix of the files involved is controlled this is still possible.
Next there is the problem of configuration. In the old system there was a single configuration file which you presented at the top of your analysis stack and which passed down to all levels. In the new system it’s much more of a mess.
- For base calling you need to provide a sample sheet.
- For alignment you need to provide a gerald conf file
- For BAM creation and variant calling you need to specify the reference genome on the command line and do this separately for every sample.
That’s 3 different places you can set the reference genome, only 2 of which are ever used! The reference genome now also needs to be a set of fasta files which need to be in files ending with .fa (mine ended in .txt but the docs don’t say that, you just work it out when you get an error when you try to use them).
Whereas the transition from one stage of the pipeline to the next used to be handled via a recursive make, the new system now works by letting you specify a command to run after the current analysis has successfully completed. This is in some ways more flexible as it allows you to insert your own custom scripts into the workflow, but it’s also a real pain to work out what set of commands you’re going to need to run to get to the very end of the pipeline, especially as these are going to reference folder names which don’t yet exist, and whose names are dynamically generated from the data in the sample sheet. Add to this that BAM file creation is now a per-sample process and this means that configuring a full run is now a major undertaking, and one which is very fragile since the pipeline itself cannot validate the options for anything beyond the initial analysis step. Unless you have a completely standard analysis then this is going to put an end to running stuff over the weekend and being sure that it will have finished successfully when you come back on Monday!
Output files
The move to use standard formats for output files should have made life easier for users of the pipeline, but in practice it’s actually harder to work with the new system than the old one.
One change which has been introduced, probably due to the number of sequences generated by hi-seq machines, is that you now get a set of output files rather than a single file for all of the analyses. This was sometimes the case for internal formats in previous releases (eg qseq files), but not it applies to some of the final outputs. The sets of files have a common start and end with a grouping number which let you know how to recombine them. So, for a GAIIx lane you might expect to get up to 8 gzipped fastq files for each lane. You can pass an option to increase the number of sequences in each file , but if you make that limit over 16million then it breaks eland, so you’re really stuck with multiple output files. This same problem affects eland output as well with multiple gzipped export files.
Another problem with the fastq output is that unlike older versions of the pipeline the fastq files in Casava 1.8 are not filtered, so you’ll have an entry for every cluster on your flowcell. A flag in the header tells you if your sequence should be filtered, but none of the downstream programs will understand that, so you’re going to have to do it yourself, which takes away the point of having fastq files assembled for you already!
Illumina do acknowledge both of these problems in the documentation, and provide documented workrounds. Unfortunately both of the workrounds they provide are broken – they say that you can simply cat the gzipped fastq files together to combine them, but unfortunately this puts gzip headers in the middle of your file. Some decompressors (gzip itself for example) can understand this and will extract all of the data, but others (the standard java gzip extractor for example) do not, and just stop extracting data (with no warnings or errors) after the end of the first compressed section. There is also a somewhat complex grep statement provided which should filter out flagged sequences to produce a fastq file, but this was reported to create broken fastq files from some inputs, and leaves you with uncompressed data.
We ended up writing our own short Perl script which filtered, combined and re-compressed the fastq files but this (well the compressing part) is a slow process which we’re currently performing manually.
The other standard output file used in the new version is BAM files for aligned output. What wasn’t said though is that this is not the default output from eland (which is still export files, albeit now gzipped). Creating BAM output requires that you run part of the variant calling pipeline to run the sort and bam targets to turn your export files into a bam. There is a stand alone script to to this without going through variant calling, but this only produces SAM files. There is also a –bam option for Eland, but you can only set this (according to the docs at any rate) if you run Eland as a stand alone program, and not by running the actual pipeline. The documentation for this part of the pipeline is tortuous with at least 3 different way to create BAM files all with their own options and requirements.
When you do finally create a BAM you find that in order to do this you must create an arbitrarily named directory (so no chance of finding it with our LIMS), which contains a completely generically named bam file (sorted.bam), so really no way to associate this robustly with a sample, along with separate bam files for every chromosome in your reference genome, even if you’re not doing any other analysis. All to get the mapped data in an output you can use with any other downstream application.
Finally, the base calls, alignments and variant calls are now not stored in a structured heirarchy under the data folder, but get their own top level folders inside the run folder (base calls go in ‘Unaligned’, export files go in ‘Aligned’ and BAM files go wherever you tell them, with no defaults). What this means in effect is that you can only run one set of base calls and alignments if you need to be able to reliably predict where your output will go. Subsequent runs will fail if the output folder already exists. I realise many people won’t ever both reprocessing their data, but we do this all the time and this change is going to limit what we can do. We can work around it locally, but we also distribute a LIMS which has to find these files on other sites and this gets to be impractical in the new system (and impossible in the case of variant calling output).
The bioinformatician strikes back…
So what is the outcome of all of this?
In our case the outcome is that we’re going to:
- Use a standard default sample sheet for all runs so we at least get consistent directory naming
- Rewrite our LIMS to be able to recognise the new folder structure generated from our standard sample sheet
- Write a script to filter and combine the multiple fastq files into a single file we can give back to our users
- Stop using Eland for our mapping. It’s just too much of a pain to put this into the pipeline and link the results with our LIMS. We were using bowtie and tophat for secondary mappings anyway but these will take over as our primary mapping pipeline.
- Work out a nasty hack to allow our LIMS to recognise reprocessed folders – which won’t work on any site other than ours.
All in all, for our use case this update has made our pipeline significantly less functional than it used to be. I can see that there are probably people who would benefit from some of the changes here, but they are far from being the promised panacea. I’d be interested to know how many groups are actually doing any downstream analysis in Casava, as opposed to stopping after base calling and using other options. I got used to getting funny looks when I said we were using Eland, and I suspect these changes will make the users of Casava for analysis an ever rarer breed.
With luck some or all of the problems we’ve experienced will be rectified in future releases. But I think once we’ve moved away from doing anything beyond base calling in Casava then I don’t see us moving back.
Addenda…
Since posting this originally I’ve received some feedback which might be useful.
It appears that you can run Casava without a sample sheet at all, even though the documentation makes no mention of this, and actually says it will read one from a default location if one isn’t specified. The output directories are named after the flowcell id for the the project name, and the lane number for the samples – which is way better than using a minimal default sample sheet.
In a more recent update Illumina say they are intending to release an update to Casava (1.8.1) which will change the behaviour of the BCL converter to not include reads which failed QC by default. They will also provide an optional flag to write all of the sequence output into a single file. This should solve our problems with this part of the pipeline.
Tags: casava, illumina, sequencing
Posted in
Bioinformatics, Computing |
No Comments »
Importing RNA-Seq data into SeqMonk
September 4th, 2011
Introduction
Mapped RNA-Seq data coming from eukaryotes is probably the most complicated data type to import into SeqMonk due to it’s relative complexity and the abundance of options with which you are presented. Depending on exactly what sort of information you want to know about your data different data import options will be useful, so understanding the best way to import your data can be important.
Options
Virtually all RNA-Seq data will start out as a BAM / SAM file since this is the only format currently capable of describing mapped positions which span one or more splice junctions. The options for import are therefore the options found in the BAM / SAM parser. The options you have are:
- Is the data single or paired end
- Do you want to split the reads into spliced segments
- Do you want to import introns instead of exons
Each combination of options will produce a different set of aligned reads and will be useful for different types of analysis. The figure below summarises the reads which are imported from a set of spliced paired end mapped data. The colours represent alignment direction (red=forward, blue=reverse).
Choosing the correct options
The options you choose will depend on what you want to do with your data. Typically you either want to do one of three things:
- Quantitatively analyse the expression of each transcript
- Quantitatively analyse splice variation
- Identify novel transcripts
If you want to quantitatively analyse expression then you need to split your reads so that separate reads are recorded for each mapped splice section. You should then combine this with the base pair quantiation so that each spliced read is counted proportionally in the different exons it spans. If you have paired end data then selecting paired end import will simply reverse the orientation of any reads coming from the second read in a pair. If your data is not directional then this won’t make any difference, but if you have a directional library this will ensure that the correct orientation is assigned to each read.
If your interest is in the analysis of splice variants then the easiest way to do this is to import only the introns from your mapped data. Introns will only be imported from a read which contains an internal splice site, reads which do not splice will be ignored. Once the introns have been imported these should be quantitated using the read position probe generator, to put a probe over every different splice junction combination and then quantitated with the exact overlap quantitation to get an exact count for each splice combination seen in the data. If you have paired end data then it’s best to import introns as paired end data so that splice junctions from the second read match exactly (position and strand) with those from the first read.
Finally, if your interest is in seeing the extend of novel transcripts then you can import spliced mapped data as a normal BAM / SAM file. Since the mapped region is not expected to be contiguous on the genome you will need to greatly increase the filter for largest allowed distance between the ends of the reads to avoid rejecting reads which span introns. This type of import will join the ends of the read (or reads for paired end data) so you get the most complete view of the region of the genome spanned by your reads, but the quantitative influence of each read will depend on the size of the introns spanned, so importing data this way should only be used as a qualitative way of viewing your data. Combining this view of your transcriptome with a more conventional spliced import in a different track should give you the most complete view of the position and extent of each of your transcripts.
Tags: import, rnaseq, SeqMonk
Posted in
Bioinformatics |
No Comments »
Getting the java heap size you asked for
August 26th, 2011
In a recent post I discussed a method we’re using for automatically setting the java heap size appropriately at runtime. It now turns out that the issue of setting the heap size is complicated by the fact that the heap size you request on the command line isn’t necessarily what you get given. In some cases the differences are modest, but sometimes they can be significant – amounting to hundreds of megabytes of discrepancy.
The simple test I did was to compare the heap size requested by setting the -Xmx value on the java command with the actual amount of available memory as reported by Runtime.getRuntime().maxMemory(). What I found was that the relationship between these two values isn’t 1:1, isn’t fixed at a given ratio, and is platform (and indeed VM) dependent.
According to this bug report the actual implementation of -Xmx is VM-dependent, so that the value you supply on the command line is merely a suggestion to the VM and it’s free to do whatever it likes. Because I’d like my software to work consistently on all platforms I therefore had a look at what the different VMs actually do.
The OSX VM actually stays very close to the requested amount of memory across the whole range of requested heap sizes. The linux and windows VMs though overcommit at small heap sizes (there seems to be a minimum allowed heap size of ~10MB), but undercommit by up to 12% at larger heap sizes. When you’re requesting a heap of serveral gigabytes in size a 12% loss is a significant amount of memory.
Our immediate solution to this problem is to do a trial run where we launch a small program which reports the actual heap size allocated. We then relaunch the normal java command, increasing the heap request size by a correction factor calculated from the trail run. This seems to produce consistent results on all platforms and gives us what we asked for in the first place.
Tags: hack, java, memory
Posted in
Computing |
No Comments »
Mac application bundle caching
August 10th, 2011
Having spent a frustrating hour or so trying to update a mac application bundle I thought I’d share a couple of things which caused no end of confusion and aren’t what you’d expect and are therefore likely to catch out those working with application bundles for the first time.
Basically I was finding that although I was making changes to my bundle these changes weren’t actually being applied. The bundle seemed to hold on to older versions of the data in the bundle even if this had been changed or even deleted.
It turns out that OSX caches information from the application bundle and then doesn’t check to see if the information in the bundle has been updated unless it is forced to. There are two separate caches which can affect you, a cache of the bundle icons, and a cache of the Info.plist file.
Icon cache
Inside your application bundle you can create a file containing a series of icons for your application. You make this file from a series of individual bitmap files normally by using the icon composer application which comes with the developer tools in OSX. What you will find is that if you update your icns file containing your application icons you won’t actually see any difference in the icons which are shown in the finder. The reason for this is that OSX caches the icons associated with files in a folder in a hidden file called .DS_Store. If this file is present then the finder will read the icon data out of that even if the icns file in your application bundle has been updated.
If you want to force the finder to recalculate the DS_Store file then you need to do the following:
- Make sure you have closed all finder windows showing your folder of interest
- From within the terminal delete the .DS_Store file from your folder of interest
- Relaunch the finder using Apple > Force Quit > Finder > Relaunch (this won’t affect any of your running applications)
This will force the finder to recalculate the icon cache for your folder and any changes to your icns file will be visible.
Info.plist cache
The Info.plist file in your application bundle provides all of the basic information about how to launch your application, where to find the icon file and various settings relating to the applications name and version. As with the application icons the information in the Info.plist file is cached by the OS, such that if you change the data in the file (say you change the name of the executable you want to run), then the bundle will remember the old value not the new one.
As an aside, if you’re having problems running an application bundle, the easiest way to see what’s actually going wrong is to open the Applications > Utilities > Console application and then relaunch your application. Any output generated on stdout or stderr by an application bundle will show up in this application and you can see any error messages which are generated.
Clearing the Info.plist cache is a bit easier than clearing the icon cache. The quickest way to reset it is to simply move your application to a different folder (say your desktop), and then move it back. This will reset the Info.plist cache and your changes should be applied.
Tags: OSX, software
Posted in
Computing |
No Comments »
Dynamically setting the java heap size at runtime
July 29th, 2011
One of the oddities about java programs is that they require you to set a maximum heap size when you start the program. What this means in effect is that you need to be able to predict the memory usage of your program before it starts, and whatever heap size you set needs to be appropriate for all of the system the program is going to run on and all of the datasets it will handle.
When you’re distributing a desktop application which needs to process tens of gigabytes of data this can be a problem. Ideally you’d like to set a heap size of a few gigabytes to give yourself enough overhead to process even the largest of datasets. However not all machines have that much RAM installed, and even if they do they require a 64 bit OS to be able to use more than 2GB of it on a single process.
Up until now we’ve resorted to setting a lowest common denominator heap size (1500M), and providing instructions for increasing this on systems which can handle it. This is however very inelegant and means we have to warn users if they’re running out of memory and make them save, reconfigure and restart the application.
We have now moved over to using a system which dynamically sets the heap size to an appropriate value at runtime. We do this by writing a Perl wrapper which launches the java application after having worked out the most appropriate heap size for the current system.
To do this we have to work out:
- Whether we have a 32 or 64 bit JRE to work with
- The amount of physical RAM in the machine
The heap size is set as 2/3 of the physical RAM which leaves enough overhead for the JRE and basic system processes. In addition we set a ceiling on the heap size. For 32 bit systems this is 1500M which is the most you can practically use given the 2GB per process limit (you have to leave something for the JRE itself). For 64 bit systems we set the ceiling at 6GB. It’s not in our interest to set the heap size too high as this ends up resulting in long freezes during garbage collection, so we set it to the largest size we’re practically going to need. We work out if we’re on a 64-bit system by parsing the output of java -version (it doesn’t matter if the OS is 64 bit if the JRE is still 32 bit).
Finding the amount of phyiscal RAM is a platform specific task. On windows we have to use the Win32 API. Under linux we parse the output of ‘free’, and on OSX and the BSDs we parse the output of top.
If the user doesn’t like our auto-configured heap size we allow them to override it by passing a -m argument to the wrapper script.
For unix-like OSs we don’t need to worry about perl being present, but for windows we compile the script into a windows binary using pp. The Win32::API module is bundled into this binary. On other platforms we don’t need to distribute this since it’s loaded dynamically at runtime only if perl’s $^O variable tells us we’re running under windows. Under OSX we can run the wrapper nicely from the command line. We’re still working out the best way to include this as part of an application bundle.
This isn’t perhaps the cleanest of solutions, but compared to the very manual process we had before it’s a lot easier for the users to get their systems set up optimally.
#!/usr/bin/perl use warnings; use strict; use English; use FindBin qw($RealBin); use Getopt::Long; use IPC::Open3; my @java_args;
# See if they manually set a heap size
my $memory;
my $result = GetOptions(
'memory=i' => \$memory,
);
if ($memory) {
if ($memory < 500) {
die "Memory allocation must be at least 500M";
}
}
else {
$memory = determine_optimal_memory();
}
unshift @java_args,"-Xmx${memory}m";
exec "java",@java_args, "uk.ac.bbsrc.babraham.SeqMonk.SeqMonkApplication";
sub print_error {
# We wrap errors like this so we can keep a windows shell window open
# so the user can see any errors we generate
my ($error) = @_;
warn $error;
$_ = <STDIN>;
exit 1;
}
sub determine_optimal_memory {
# We'll set a ceiling for the memory allocation. On a 32-bit OS this is going
# to be 1500m (the max it can safely handle), on a 64-bit OS we won't take more
# than 6GB
my $max_memory = 1500;
# We need not only a 64 bit OS but 64 bit java as well. It's easiest to just test
# java since the OS support must be there if you have a 64 bit JRE.
my ($in,$out);
open3(\*IN,\*OUT,\*OUT,"java -version") or print_error("Can't find java");
close IN;
while (<OUT>) {
if (/64-Bit/) {
$max_memory = 6000;
}
}
close OUT;
warn "Memory ceiling is $max_memory\n";
# The way we determine the amount of physical memory is OS dependent.
my $os = $^O;
my $physical;
if ($os =~ /Win/) {
$physical = get_windows_memory($max_memory);
}
elsif ($os =~/darwin/ or $os =~ /bsd/i) {
$physical = get_osx_memory($max_memory);
}
else {
$physical = get_linux_memory($max_memory);
}
warn "Raw physical memory is $physical\n";
# We then set the memory to be the minimum of 2/3 of the physical
# memory or the ceiling, whichever is lower.
$physical = int(($physical/3)*2);
if ($max_memory < $physical) {
return $max_memory;
}
warn "Using $physical MB of RAM to launch seqmonk\n";
return $physical;
}
sub get_linux_memory {
# We get the amount of physical memory on linux by parsing the output of free
open (MEM,"free -m |") or print_error("Can't launch free on linux: $!");
while (<MEM>) {
if (/^Mem:\s+(\d+)/) {
return $1;
}
}
close MEM;
print_error("Couldn't parse physical memory from the output of free");
}
sub get_osx_memory {
# We get the amount of physical memory on OSX by parsing the output of top
open (MEM,"top -l 1 -n 0 |") or print_error("Can't get amount of memory on OSX: $!");
my $total_mem = 0;
while (<MEM>) {
if (/^PhysMem:.*?(\d+)M\s+used,\s+(\d+)M\s+free/) {
$total_mem += $1;
$total_mem += $2;
}
}
close MEM;
unless ($total_mem) {
print_error("Could't parse physical memory from the output of top");
}
return $total_mem;
}
sub get_windows_memory {
warn "Getting windows physical memory\n";
# This code was adapted from an answer posted by Tom Feiner on
# stackoverflow
#
# http://stackoverflow.com/questions/423797/how-can-i-find-the-exact-amount-of-physical-memory-on-windows-x86-32bit-using-per
my ($max_memory) = @_;
eval {
require Win32::OLE;
Win32::OLE->import qw( EVENTS HRESULT in );
1;
} or do {
print_error("Couldn't load Win32 module to determine windows memory");
};
my $WMI = Win32::OLE->GetObject( "winmgmts:{impersonationLevel=impersonate,(security)}//./" ) || print_error ("Could not get Win32 object: $OS_ERROR");
my $total_capacity = 0;
foreach my $object (in($WMI->InstancesOf( 'Win32_PhysicalMemory' ))) {
$total_capacity += $object->{Capacity};
}
my $total_capacity_in_mb = $total_capacity / (1024*1024);
return $total_capacity_in_mb;
}
Tags: java, software
Posted in
Computing |
1 Comment »
Want to improve your science? Get a dog.
June 19th, 2011
Actually the dog is somewhat irrelevant – it’s what comes with it which matters. One of the side-effects of dog ownership is that you get to spend an hour or so a day out walking, which means you have an hour or so with your own thoughts and no distractions.
I’m sure everyone has experienced the situation where you go to sleep thinking about a problem, and then wake up in the morning with the answer. It’s pretty common for your brain to replay the events of the day whilst you sleep and it’s surprising how just making time to think about something can can help you to clarify and understand problems you’re working on. Doing this overnight is good as far as it goes, but can end up somewhat surreal as you effectively relinquish control of your train of thought.
The same principle applies during the day though. If your brain isn’t immediately occupied with a specific task it will tend to go back over other problems you’re working on at the moment. The problem is that in a modern office or lab setting we hardly ever have time when we’re not being presented with something to do. Even when we’re not at work it feels really unnatural to not be doing anything. There’s always something to read, watch or do so your brain never gets chance to drift back to the topics which could usefully employ it. That’s where the dog comes in.
I actually first noticed this effect when I started cycling to work. My commute by bike takes me about an hour (which is why I don’t do it all that often!), but I found that when I cycled to work I would be hugely productive for the first couple of hours of the day and find really creative solutions for things I was working on. I first put this down to having more energy from my early morning exercise, but I then found that if I put on a podcast whilst I cycled I was no more productive than if I’d driven – so it wasn’t the cycling, it was having an hour where my brain had nothing to do other than chew over the problems I was working on.
Dogs are even better in this respect in that you don’t really get a choice about whether you’re going to take them out. You’re forced to spend some time every day with no external distractions from which you can gain the attendant benefits.
Somewhat perversely I find that conferences work this way too, only they provide a double benefit. Anyone who’s ever been to a conference will know that talks fall into two categories – those which capture your attention and provide new and interesting ways of interpreting your science, and those in which you switch off and stop listening in the first five minutes. Actually I’d argue that both of these types of talk are beneficial. The first inspires you and gives you new information to process, and the second gives you some uniterrupted time to think about what you learned in the first. Many of the most intersting projects I’ve worked on have begun during poor talks in a conference, where I’ve stopped listening to the current talk and have sketched out the structure for a software package, or a theory which I could test. Granted it doesn’t always work this way. I once spent an hour in a particularly dull keynote writing a program which would rate all of the keynotes in a conference by the change in ping times on the wireless network (it’s conclusions matched surprisingly well with my own personal judegment), but if not productive at least that was creative.
My tip for improved science then is not necessarily to get a dog – although I can heartily recommend doing so – it’s to force yourself to make some time each day where you don’t have anything to think about. Your brain will thank you for it.
Posted in Bioinformatics, Computing | No Comments »
Choosing the best format for raw sequence data
June 16th, 2011
Introduction
In the current Illumina pipeline raw sequence data is generated in qseq files, but can optionally be converted to the more standard FastQ format for use with other analysis programs. The FastQ files produced are uncompressed text files and take up a considerable amount of space in our storage system. We’ve therefore been thinking about either compressing or converting these files to save on the amount of storage they require.
At the same time we’ve also been expanding the range of compression schemes supported in FastQC which gives us a good impression of how quickly we can extract data from the different available formats and since I’ve collected some data on the storage and processing requirements for the different formats available I thought I’d share these to help inform others who may be making similar decisions.
The choices we have are to simply compress the fastq files with a standard compression scheme. The two most commonly used are gzip and bzip2. Many analysis programs support gzipped fastq files as input in addition to uncompressed files, and a few are starting to support bzip2. Bzip2 is often chosen over gzip because it compresses data more efficiently. The other choice is to put the raw data into a BAM file. This format is specifically designed to hold high throughput sequence data and uses a compression scheme which is designed to be optimal for sequence data. The BAM format was primarily designed to hold information about sequences which had been mapped to a reference sequence, but it also allows for raw sequences with no associated mapping to be stored, but with some overhead for the mapped position fields which are not used.
For the tests I used a fastq file containing around 500,000 reads of 33bp in length. The processing times were taken as the time to process the file completely with FastQC. Since the processing overhead for the QC analysis should be the same in all cases any differences will be attributable to the different amount of data needing to be read from disk, and the CPU time required for the decompression. The tests were run on a MacBook Pro laptop (so not the fastest hard drive, or the speediest CPU). FastQC uses pure java decompression code. For gzip compression this is built into the JRE and for bzip2 I used the jbzip2 library.
Results
| File type | File size | Time to process (seconds) |
|---|---|---|
| Uncompressed FastQ | 69.8MB | 14.1 |
| Gzip Compressed FastQ | 17.5MB | 11.0 |
| Bzip2 Compressed FastQ | 13.9MB | 72.1 |
| BAM | 16.3MB | 11.5 |
Conclusions
It is clear that converting your raw FastQ files to a more efficient storage format will produce significant gains in disk space usage. Reducing your storage requirements by a factor of 4-5X and actually making your processing more efficient in some cases is a win-win proposal.
From the results presented it seems clear that unless disk usage is critically important then bzip2 compression is not a viable solution. Increasing your processing time by over 500% for a 20% reduction in size does not seem to be a good trade off. I’ve also checked using the command line gzip/bzip2 decompression utilities in case the effect I saw was an artefact of the java implementations, but the size of the difference between the two was similar there as well.
Choosing between gzip and BAM is less clear. It’s probably fair to say that at the moment more analysis programs support gzipped fastq files as input than support BAM files (which is more normally seen as an output format), but this may change in the future. BAM files offer the prospect of adding in mapping data alongside your sequence data with a minimised increase in filesize which may be a benefit to some. If you’d prefer to keep your raw data separate from derived data then gzipped FastQ files would seem to be the better choice.
In our case we’re going to opt for simply gzipping our FastQ files since this seems to be a simple process which won’t affect any of our existing workflows or processing and which will return to us a significant amount of storage space.
Tags: compression, fastqc, java
Posted in
Bioinformatics, Computing |
No Comments »
Adding custom chromosome name mappings into SeqMonk
June 11th, 2011
When loading data into SeqMonk the program has to try to connect the chromosome names used in your data file with those which are present in the genome one which your project is based. In many cases there won’t be an exact match between the two – many mapping programs report file names in their output so you often see names like ‘chr1.fa’ used to represent chromosome 1.
In order to match as many entries as possible SeqMonk tries a few different tricks to match up the chromosome names if it doesn’t immediately find an exact match. It will do simple manipulations such as removing ‘chr’ from the front of the name and removing things like ‘.fa’, ‘.txt’ and ‘.gz’ from the end. However in some cases it isn’t able to match up the names, and in these cases the read is rejected and you’ll get an import warning saying:
‘Couldn’t extract valid name for [some string which wasn't a chromosome name]‘
The most common offender for this message is the mitochondrion which is variously labelled as M, Mt, MT or some other minor variation. The other one we commonly see are the yeast genomes which seem to alternate between using roman and arabic numbers to describe their chromosomes.
If you get this error and need to import the data which was rejected then how do you bring this data into SeqMonk? The most obvious way is to go back to your data file and rename your chromsomes to match those in SeqMonk. Whilst this is simple to describe it’s often practically quite hard since data files are usually very large, and if they are in a non-text format (eg BAM), then modifying them isn’t at all straight forward.
A better alternative then is to tell SeqMonk in advance about any chromosome name mappings which it isn’t able to figure out for itself. There is a pretty simple way to do this, but it isn’t very obvious in the documentation and most people who hit the problem don’t seem to find it so I thought I’d raise its profile.
The way to add custom chromosome name mappings in SeqMonk is to create a text file called ‘aliases.txt’ in the folder which contains the data for the genome assembly you’re using. If you were using the mouse NCBIM37 genome for your project the file would need to go inside the [Genomes dir]/Mus musculus/NCBIM37 folder. This file is a 2-column tab delimited text file where the first column is the name you want to be able to use in your data files and the second column is the name of a chromosome in your SeqMonk genome. Both of these names must be an exact match since the aliases are not subject to the same manipulations as normal chromosome names read from a data file. There is no problem having multiple aliases pointing to the same SeqMonk chromosome, but you can’t have the same aliases pointing to more than one SeqMonk chromosome (if you do, then the last one wins).
So, and example aliases file would look like:
chro1 1 II 2 NC1234 3 c3 3 bob X
Once this file is in place and SeqMonk has been restarted (or at least the genome has been reloaded) you should be able to import data files containing any of the names in the first column into your project without seeing any warnings. This file is only required when the data is first imported, after which all the chromosome names are turned into SeqMonk names, so you don’t need to distribute the aliases file to be able to use SeqMonk projects which required that file to have their data initially imported.
Tags: SeqMonk, software
Posted in
Bioinformatics |
No Comments »
Interpreting the duplicate sequence plot in FastQC
May 23rd, 2011
Background
The one analysis module which seems to elicit more questions than any other is the duplicate sequence plot. Of all of the plots which the program generates it’s probably the one which causes the most warnings / errors in otherwise nice looking data. I’m happy to admit that it’s not always immediately obvious what the plot is telling you, but it does contain some really useful information, so it’s worth getting to grips with it!
The Basics
So what does the duplicate sequence plot tell you? As with all of the plots in FastQC it’s not there to tell you if your data is good or bad, it’s there to tell you if your data looks unusual in some way. More specifically the point of the duplicate sequences plot is to tell you to what extent you are wasting the sequencing capacity you have paid for by simply resequencing the exact same sequences over and over again.
The point of the plot is that in a diverse random library, even at relatively high coverage you shouldn’t resequence the exact same region multiple times. If across your entire library you are sequencing each fragment multiple times then in most cases you stop learning anything new about your library. You could effectively get the same amount of information by doing less sequencing, which either means you would benefit from a more diverse library, or you could do less sequencing (eg mixing the library with others in a multiplexed experiment).
One of the slightly confusing aspects of the plot is the scale used for the y-axis, but this scale is a result of compromises made to allow FastQC to run in a reasonable amount of memory. In an ideal world we would analyse every sequence in a library and count how many times each one occurred. If we did that we could put absolute numbers on the plot and this would make it easier to interpret. However, if we had a perfectly diverse library with no duplicated sequences this would mean holding the whole library in memory at once which isn’t practical on many machines. What happens instead is that the analysis occurs only for the first 200,000 different sequences seen. The number of occurrences of these sequences is then tracked through the rest of the file, but any new sequences after the first 200,000 are then discarded. Because this precludes the possibility of generating absolute numbers we plot the results on a relative scale, with the number of sequences occurring exactly once being set as 100%, and everything else being shown relative to that number. The plot value for duplicate 1 is therefore always 100%, and higher duplication levels can produce values either higher or lower than this.
In an ideal plot for a diverse library the values for duplicate levels above 1 should quickly decay to zero and stay there. However there are several ways in which the plot can indicate other types of problem in the library.
How things can go wrong
There are a number of different potential modes of failure which show up in the duplicate sequences plot. Here we try to provide examples of some of the most common ones and explain how they might occur.
A small number of sequences dominate an otherwise diverse library
In some cases you may find that a very small number of sequences end up comprising a large proportion of your library. The most common example of this is contamination by adapter sequence but any small source of contamination could conceivably cause a similar effect.
In this case the shape of the duplicate plot would look normal, with the unique sequences being the highest value and the rest of the plot quickly decaying to values close to zero, however the overall duplicate percentage shown at the top of the plot will be very high. The reason the plot looks normal is that it is limited to a duplicate level of 10+ so a very small number of highly duplicated sequences won’t cause the plot to peak.
The easiest way to confirm this type of problem is to look at the overrepresented sequences module results which should list the overrepresented sequences along with the proportion of the library they comprise.
Every sequence in a library occurs a large number of times
If a library has very limited diversity is then subjected to high throughput sequencing then it’s possible that every sequence seen in the library is likely to be seen many times.
In this case the plot trace will be low through all duplication levels until the 10+ bin, where the plot will spike sharply, possibly showing a relative level of many hundreds of percent relative to the unique sequences. The overall duplicate level will probably also be high (>90%).
The root cause of this type of result is low diversity in the original library. This could be because you are hugely over-sequencing a diverse library – a PhiX control lane for example would show a result like this, or it could be that the library diversity is unexpectedly low. The most common cause of unexpected loss of diversity is PCR over-amplification where a small subset of the full library is artificially expanded through the use of too many cycles of amplification. Such libraries can also be found from experiment types which are expected to generate limited diversity, such as those which are based around restriction sites rather than random fragmentation.
A low level of duplication is seen across a whole library
In some cases you will see a less extreme plot where the rate at which the duplicate plot falls from unique sequences is slow – showing appreciable proportions of the library with duplication levels of 4-5, and it may show a small spike in the 10+ bin.
In some cases this will simply be a less extreme version of the last situation where every sequence in the library is subjected to low level duplication, and natural variation spreads this across a somewhat broad range.
In other cases though you might be dealing with a library which has highly variable levels of duplication – and this may have a biological rather than a technical cause. The most common type of library to produce this type of plot is an RNA-Seq library. In this type of library it is expected that some sequences will occur very frequently, and others will be very rare. If you want to see the very rare sequences (eg low copy number transcripts), then you will have to greatly over-sequence the most frequent sequences (eg housekeeping genes), so a high level of duplication in part of the library is unavoidable.
A final warning
One final thing to bear in mind when looking at duplicate sequence plots is that they are based on an exact match between sequences. Any sequencing errors in the library will tend to create artificial diversity in the library – making identical sequences look different because of technical errors. To mitigate against this effect in long reads only the first 50bp of each sequence are used to assess the duplication level – however if the sequence quality in a library is very poor then the duplication plot for a heavily duplicated library might be made to look perfectly normal due to the introduced errors. You should therefore always consider the results of the sequence quality plots (especially the per-base quality plot) alongside the duplication plot to gain a realistic assessment of the true level of duplication.
Tags: duplication, fastqc, interpretation, quality
Posted in
Bioinformatics |
No Comments »

