Archive for 2009

Overestimating DWIM in Perl

Saturday, October 17th, 2009

I hit a bug in a script I was writing this week which reminded me that sometimes you can put too much faith in perl’s ability to ‘do what I mean’ (DWIM). It took me a couple of minutes to see what was going wrong here – see if you do any better.

#!/usr/bin/perl
use warnings;
use strict;

my $data;
populate_data($data);
print_data($data);

sub populate_data {
  my ($d) = @_;

  $d->{somekey} = 'somevalue';
}

sub print_data {

  my ($d) = @_;

  if (exists $d->{somekey}) {
    print 'Win!';
  }
  else {
    die 'Fail!';
  }
}

No errors, no warnings, but epic fail.

One of the things which confuses people learing perl is the creation of complex data structures and autovivication.  That is to say if you treat a scalar as a reference to an array of array of array of arrays (for example), then that’s exactly what it becomes – you don’t need to explicitly define the structure as it is all created on the fly.

In this case though I was expecting too much.  When the $data variable is first created it has an undefined value.  When I pass it into the subroutine all that is passed is the undefined value.  Adding data to the undefined value autovivifies the expected data structure, but since this was created in the subroutine there’s no way for that to propogate back to the calling code.

The answer is to define at least the top level of structure before calling the populate_data subroutine.  That way you are passing in a valid reference to the subroutine (where further structure can be added), but the data is added to an anonymous data structure which is still referred to in the calling code.

The simple fix is therefore to make the initial declaration of $data be:

my $data = {};

..and Fail becomes Win.

Tags: , , ,
Posted in Computing | Comments Off


Sound Level Monitoring in Java

Tuesday, October 6th, 2009

For a project I’ve been working on I needed what I thought should be fairly easy to make – a simple widget to monitor an input sound level.  I’d never worked with the full javax.sound API before, but assumed that there would be ample documentation to do what I needed.

Having started on the project things turned out to be a bit more tricky than I thought.  Most of the examples I found revolve around passing sound from an input to an output, and the sound API makes this fairly easy.  What proved to me more tricky was the intercepting and processing of a sound input.  I’ll therefore go through what I did to make this work in the end.

Let’s start with the basics:

What you need at the end of the day is a TargetDataLine object.  The nomenclature of the javax.sound API is somewhat unusual in that a SourceDataLine is a sound output line (such as a set of speakers), whereas a TargetDataLine is a sound input line (such as a microphone).

You have two choices for getting a TargetDataLine, you can either go through a rigorous process of exploring the capabilities of all of the sound lines on your system and find some way to select the best match to what you want.  Alternatively you can decide on the type of line you want and request that from the sound system.  If this type of line is not available then you will get a LineUnavailableException.

Having tried both approaches I ended up plumping for the second option.  Since I didn’t need high precision and wasn’t asking anything difficult of my sound source I could select conservative properties for my line and trust that nearly all sound cards would be able to support this.  In practice this code has worked on every machine I’ve tried it on, but conceivably it could fail on some sound cards.

You need two things to get a TargetDataLine.  An AudioSystem and an AudioFormat object.  The AudioSystem class provides a series of methods through which you can access the components of the audio system on your machine.

An AudioFormat object defines the characteristics of the sound stream you want to obtain.  There are a few different parameters you must set and there are a range of acceptable values which will be supported by the majority of sound cards:

  • Sample rate: This says how many times per second the sound will be sampled
  • Sample size: The number of bits in each sample taken
  • Number of channels: Whether this input is mono or stereo
  • Signed: Whether the sound samples are signed or unsigned
  • Endianness: For multi-byte sample sizes says whether the byte order is big or little endian

Since I was only interested in monitoring a line level I chose fairly conservative values:

  • Sample Rate: 8000.  This is about the lowest commonly used sample rate.  CD quality sound is 44.1kHz, but this is overkill for a simple sound monitor
  • Sample Size: 16 (bits).  You can also use 8 bit samples depending on the resolution you are after.
  • Channels: 1 My interest was in overall sound level so mono sound was OK
  • Signed: true.  This is the more common option, and is easier to deal with in java since all java primitives are signed.
  • Endianness: Since my samples are 16 bit then byte order matters.  Most common audio formats (eg WAV) are little endian as is the x86 architecture, so this is the common choice.

In practice this means that to get a TargetDataLine I can do something like this:

 AudioFormat audioFormat = getAudioFormat();
 targetDataLine = (TargetDataLine) AudioSystem.getTargetDataLine(audioFormat);
 private AudioFormat getAudioFormat(){
   float sampleRate = 8000.0F;
   int sampleSizeInBits = 16;
   int channels = 1;
   boolean signed = true;
   boolean bigEndian = false;
   return new AudioFormat(sampleRate,sampleSizeInBits,channels,signed,bigEndian);
 }

Once I have my TargetDataLine I then need to activate it before I can read any data from it.  Activating the line is as simple as:

 targetDataLine.open();
 targetDataLine.start();

Once the line is open you can begin to read data from it.  The usual way to do this is in a separate thread where you have a buffer which you fill with data from the line and then process.  A read from a line will block until the buffer is full, so the size of your buffer will determine how often you process the data.

The amount of data produced will be a function of your sample rate, sample size and number of channels.  If you have one channel and 16 bit samples at 8kHz then every second you will produce 8000 * 16 * 1 bits of data or 16000 bytes of data.  Therefore a 16000 byte buffer will be filled every second, and 8000 byte buffer will fill in half a second.

Reading the input can therefore be done is a loop as follows:

byte [] buffer = new byte[2000];
while (true) {
  int bytesRead = targetDataLine.read(buffer,0,buffer.length);
}

The remaining problem is then how to process the filled buffer to get the overall level.  There are a few options for this, you could work out the average sound level over the whole sample, or you could work out the peak level throughout the sample.  I chose the latter method, but the basic process would be the same in either case.

Since my samples were 16bit I had to take into account that each sample was composed of two bytes, and I needed to recombine these into a signed short before processing them.  This involves using a bit shifting operation to combine the two bytes, and taking into account the little endian byte order.

short max;
if (bytesRead >=0) {
 max = (short) (buffer[0] + (buffer[1] << 8));
 for (int p=2;p<bytesRead-1;p+=2) {
   short thisValue = (short) (buffer[p] + (buffer[p+1] << 8));
   if (thisValue>max) max=thisValue;
 }
 System.out.println("Max value is "+max);
}

For an 8 bit sample I could use the individual bytes directly.  For a big endian sample the p and p+1 positions in the generation of the short would have to be reversed.

Using this method I can now sample the input line at any rate I choose to get the raw data from which a peak meter could be produced.  The final thing to remember is that our perception of sound should always be viewed on a log scale.  If you are creating a sound meter you should therefore log transform the data before plotting to gain a more realistic view of the sound level.

Tags: , , ,
Posted in Computing | Comments Off


Mapping Bisulphite Converted Sequence

Thursday, August 27th, 2009

I’ve been thinking lately about the best way to construct a mapping pipline for large sequence datasets which have been bisulphite converted.

Bisulphite conversion is mostly used to detect DNA methylation, although other uses are also being found.  The basic principle is that treating DNA with sodium bisulphite modifies cytosine bases such that when they are copied they pair with adenine instead of guanine.  Effectively all Cs are converted to Ts.  The exception is if the Cytosinse is methyated in which case it is not modified by bisulphite.

By comparing a bisulphite converted sequence to a known reference you can therefore tell which cytosines were methylated.

The problem is that in order to search a converted sequence to a reference you need a specialised search program which understands the changes introduced by the conversion and is still able to match these to the reference in their converted form.

There are a few programs available which allow you to do this such as BSMap or mrsFast.  These provide custom aligners which can match fully or partially converted sequence to a reference, but they can still leave a number of problems.

The main problem is that many types of experiment rely on being able to count the number of times a particular region is observed as methylated and unmethylated and using the ratio of the two to infer the overall methylation status.  The problem with doing this is that if your sequence is methylated it will still contain Cs, and therefore has more information than the converted form.  In many cases this means that the unconverted form is uniquely mappable within the genome, but the converted form isn’t.  In these cases you will bias your results in favour of the methylated sequences.

Also, bisulphite treatment is often not complete.  Some unmethylated Cs will not be converted just by chance.  Being able to make an assesment of the convertion efficiency allows you to correct the results you observe.  In mammals this is simple since methyation should only be observed where a C forms part of a CG dinucleotide.  Any protected Cs which aren’t part of a CG therefore represent incomplete conversion.  In other species such as Arabidopsis though methylation can occur on any C so a different way to measure conversion efficiency is required.

Our proposal for mapping is therefore quite simple.  All mapping must be done on fully converted sequence.  If we generate a fully converted genome sequence (which requires separate conversions for forward and reverse strands), then we can completely convert our sequence tags and try to map them against the converted genome.  If we find a unique match then we can compare the actual observed tag sequence to the unconverted genome sequence to obtain the methylation status.

There are a few advantages to this approach.

Firstly it requires no specialised mapping software.  Any of the standard mapping programs will be able to do the search, albeit with reduced efficiency due to the reduced complexity of the sequence.

Mapping with fully converted sequence ensures that there is no bias towards methylated sequences.  The downside of this is that some potentially mappable sequences are excluded – but at least we’re on a level playing field.

Operations such as conversion efficiency calculation can be introduced during the methylation calling step, and it is fairly simple to introduce rules to say whether a methylated base should appear as part of a larger sequence pattern.

Posted in Bioinformatics | Comments Off


Managing Really Large Data Sets

Thursday, August 27th, 2009

For a while now I’ve been working with next generation sequencing datasets.  Each dataset consists of around 10 million mapped genome positions, and an experiment can consist of 10 or more datasets.

When analysing this data memory usage is a major issue.  Up until now our approach has been to try to store everything in memory as efficiently as possible but even using machines with 12+GB of memory we’re hitting scaling limits.

To try to alleviate this I’ve been investigating using disk based storage rather than putting the data in RAM.  I’d avoided this before as I wanted the application to be as responsive as possible and didn’t want the overhead of the extra IO.

I’ve now taken the plunge and started experimenting with disk based storage, and my initial results are encouraging.

I chose to go with SQLite using the SQLiteJDBC connector.  My inital attempt was pretty basic using only a single table as a sort of disk based Hashtable and I hadn’t even got round to adding indices to speed up searches.  Despite this the speed of data access was so quick I couldn’t tell the difference from the in-memory storage, and the memory usage dropped significantly.

Since this worked out so successfully I’m not considering doing a major rewrite to put all of the application data into a single large SQLite database and just caching objects for the currently visible chromosome in memory (to allow quick redraws).  This is going to require some major refactoring, but it should allow the program to scale to much larger datasets and should make opening a project a virtually instantaneous operation.

Posted in Bioinformatics, Computing | Comments Off


Optimising Java Memory Usage

Sunday, June 21st, 2009

In the application I’m working on I have to deal with large amounts of data, which means handling tens of millions of java objects.  One of the biggest problems we face is keeping the memory usage of the program under control.

The main object which accounts for the vast majority of the memory consumption contains only a few fields.  These are:

  • A reference to a different object
  • An int to specify a start
  • An int to specify and end
  • An int to act as a flag (which only has 3 values)

To try to improve the memory profile of this object I tried changing the flag value to use a byte instead of an int, and also storing a start and a length (as a short) instead of an end value (as an int).

Making these changes I was surprised to find that the memory usage of the application didn’t change at all.  I didn’t see how this could be, since I was using smaller data structures so I went looking for details of how java allocates memory.  I was surprised at the results!

Basically, most class level variables in java consume the same amount of memory (4 bytes), except for longs and doubles which consume 8 bytes.  This means that it doesn’t matter if your variable is a boolean or an int it still uses the same amount of memory.  If you put a series of variables into an array then there is an overhead for using the array in the first place, but the individual members of that array are packed efficiently such that an array of bytes would take less memory than an array of ints of the same length.

In our case I tried out a few different options for combining the 3 ints at the core of our object to try to save memory.  I took the standard 3 int version of the class as the basis for comparison.

  • As stated above moving to an int, a short and a byte used exactly the same amount of memory, and slowed the application down slightly as more calculation had to be done to extract the end value each time.
  • Using a single long and bitshifting to fit an int, short and byte into it did save memory, but is not as efficient as I’d hoped since there is still one unused byte in the long – and I’m only using 3 possible values out of the range which I could encode in the byte.
  • Using a byte array to store 7 bytes (not wasting the extra byte used in the long) was much less efficient than any of the other solutions.

We’ve therefore gone down the route of packing data into a single long, which is saving ~20% of the previous memory usage, at the expense of some overhead during the packing/unpacking.  The packing overhead is pretty minimal though, adding only 2 seconds on a test which took around a minute in the original code.

Doing the packing/unpacking is fairly straightforward using the bitshifting operators in java.  In the following excerpt packedPosition is a long, start is an int, length is a short and strand is a byte.

packedPosition = start;
packedPosition += (long)length<<32;
packedPosition += (long)strand<<48;

Getting the data back out is also fairly simple.  For example, extracting the length would be achieved by doing:

(short)(packedPosition>>32;

We’re now looking for other cases where adding this extra complexity is worth the memory savings we can achieve.


                

Posted in Computing | Comments Off


Exporting SVG from Java

Sunday, February 8th, 2009

Several programs I’ve written in Java have had an image export componet to them.  Up until now the export has always only been as a bitmap image.  This is very easy to do using a BufferedImage and an ImageWriter.  Given a Component (all AWT or Swing widgets) and a file you can create a PNG very simply:

  BufferedImage b = new BufferedImage(c.getWidth(),c.getHeight(),BufferedImage.TYPE_INT_RGB);
  Graphics g = b.getGraphics();
  c.paint(g);
  ImageIO.write((BufferedImage)(b),"PNG",file);

I’ve always thought though that it should be possible to create an SVG file from an arbitrary component, but never got round to trying it.  However I got snowed in last week and decided to give it a go, and it turned out to be somewhat easier than I thought.

There is a natural fit between the abstract methods of the AWT Graphics class and the primitive components in the SVG spec – they even use the same coordinate system.  Creating an SVG export class therefore simply requires implementing the SVG spec and translating the method calls to SVG code.

Although the basic premise is fairly straightforward – there were a few gotchas which had be scratching my head!

  1. The Graphics class has a create() method which returns a Graphics object.  Initially I was just returning the same object each time and this seemed to work.  With more complex nested objects though the coordinates were getting messed up.  I worked out that it was important that each Graphics instance keeps track of its own translate coordinates as these are managed separately in each instance.
  2. Translate coordinates are relative and not absolute.  When a new set of coordinates are passed via the translate() method these must be added to the current translation, and don’t replace it.
  3. Fonts are problematic.  Size is fine, but font names are not trivially converted between the java font name and something SVG can understand.  It may be possible to figure out some generic rules or use a translation table, but since my applications all use a default sans-serif font of varying sizes I’ve hardcoded this for now.
  4. Some methods, such as font metric creation are a bit of a pain to write – so I’ve cheated and created an unused BufferedImage Graphics object within my class of the appropriate size and pass any difficult method calls to that – I can then discard it at the end.  This is slightly wasteful of memory, but is a quick fix.
  5. I coded my initial implementation on a Mac and it was all working.  I then tried it on Windows and Linux and got empty SVG files.  I resorted to a question on StackOverflow to figure out what was going on, and it turned out that the double buffering mechanism was causing the problems.  If this was enabled then all I saw in my Graphics class was a single call to drawImage() with a complete bitmap in it.  This was passed directly from the offscreen buffer, but meant I couldn’t intercept the individual shape method calls.  I therefore now use the RenderManager to disable double buffering on the component before calling its paint method which ensures nothing comes between my class and the Graphics methods calls.
  6. Lots of code seems to assume that the object which gets passed to paint always implemets Graphics2D.  I put a conditional check into my code for this, but it might be a good idea to implement (with null methods) the extra Graphics2D methods if this was to be used genrically.

At the end of all of this I now have a class which implements a single static method of:

public static String convertToSVG (Component c);

Which has worked in every instance I’ve tried.  There are some methods I’ve not implemented – namely the paintImage methods and some of the drawPolygon methods.  The polygon methods should be easy enough.  The paintImage may be more problematic, but I don’t feel too bad leaving these out since there’s not much advantage to using SVG if you’re just going to stick a bitmap into it.

The SVG code will be in the 0.3 release of SeqMonk and I may release it as a stand alone package if anyone shows any interest in it.


                

Tags: , ,
Posted in Computing | Comments Off