Archive for the ‘Computing’ Category

Where do you analyse next gen sequence data?

Tuesday, July 13th, 2010

We had an interesting discussion at the Bioinfo-core workshop at ISMB2010.  The discussion centred around the best way to handle the logistics of making sequence data available to using a sequencing service.  The problem is that the data is so big that even if you have a large central store you run the risk of the users taking copies of the data and filling up all of your day to day storage as well.

There are a couple of options which present themselves:

  1. You just pass a copy of the data to the users and let them worry about where they put it.  This is of course the easiest path, but will end up with users putting both data and subsequent analysis on insecure local disks on their workstations – potentially leading to big data losses when things inevitably go wrong.
  2. You keep all of the data on the central store and then expose this over the network (probably in a read only fashion) to the users.  This eliminates the problem of duplicated storage, but could still be problematic if the users generate large derived datasets.  It will also place a heavy stress on your network infrastructure as the data will effectively be downloaded every time a user runs an analysis against it.
  3. You keep all of the data on the central store and provide a remote access system for analysing it.  In this scenario you eliminate both the data storage and transfer problems since the data is always contained within the central facility and only views of the derived data are shown to the user.  The big downside is that the core facility is then responsible for managing the computational power required by all of their users for all of their analyses, which could add significant cost and workload to the core facility.

Option 3 is the most logical and efficient, but would require convincing users of a service that they should be paying up front for the costs associated with all of the future storage and analysis of their data.  Given that enterprise storage is significantly more expensive than most users would expect then this could be a hard sell, but it’s probably something we should at least try.

Another aspect to consider in all of this is that as the volume of publicly available sequencing data increases there are going to be many more occasions on which users will want to re-analyse public data rather than generating the data themselves.  Managing the influx of this external data is also likely to be a signficant challenge for core facilities.  These data sets are every bit as large as those created in house, and will take just as much computational resource to analyse.  It would make sense to be able to include these sorts of data sets into the same storage and analysis system you use for your in-house data, albeit that you will have the option to delete the primary data should disk space become limiting.

The long term solution to many of these problems may come from cloud computing.  Having data stored in central repositories and then making these available to servers in the cloud would provide a scalable solution which can provide the sort of cost effective flexible options we’re all likely to need.  The problem is that at the moment the costs of storage in the cloud make this sort of system uneconomic.  Cloud storage has not typically been associated with the analysis of huge datasets running IO bound analyses, but there’s no intrinsic reason why this shouldn’t be possible.

A wish for the future would be that cloud service providers recognise this potential market and cater to it.  What this would require would be to make a read-only copy of some of the main data repositories (SRA, GEO, EMBL etc) permanently available to any cloud server (possibly at a small cost).  Since these data repositories could be shared between users this wouldn’t be as arduous as having each user maintain their own data and would then let users simply scale their CPU demands to allow them to perform their desired analyses.  This would dramatically lower the barrier preventing people getting into this area and would reduce the cost for people not able to saturate a dedicated compute cluster with their own analyses.

Posted in Bioinformatics, Computing | Comments Off


First impressions of iOS4 on new iPod touch

Sunday, July 4th, 2010

My original first gen iPod touch has been, without doubt, the best gadget I’ve ever purchased.  It’s been used daily for almost 3 years now as an MP3 player, email client, TV, photo album, and PDA.  However the headphone socket has only played out of one ear for about 6 months, and about a month ago I lost a whole swathe of pixels at the bottom of the screen, so today I went out a bought a new one.

Initially I though this was going to work out well.  I synced the new iPod with iTunes and it said it was going to upgrade to iOS 4 and restore the settings from my old iPod – great.  However I’m now collecting a list of problems which I thought I’d enumerate here.

  1. Couldn’t connect to my home wifi network.  The new iPod picked up the stored password from my settings, but couldn’t connect to my network.  All other machines (and my old iPod) worked fine.  I had a very similar problem where my Macbook Pro wouldn’t connect to my parents network after upgrading to snow leopard.  Research for that problem let me to several posts which linked the problem to the use of WEP on the connection.  I therefore changed the encryption to WPA and the new iPod now connects.  Is this a backdoor route through which Apple hope to increase the security of home networks?
  2. Couldn’t connect to my email.  Although my email settings were carried over, the password wasn’t.  Oddly I wasn’t prompted for it but rather told to go into the mail settings to find the appropriate box in which to add it.  Once I’d added the password I could connect.
  3. But – I’ve had some oddities with my email.  I uses an SSL encrypted IMAP account and I’ve had messages which wouldn’t move to unread status no matter how many times I viewed them.  I also had a ‘ghost message’ with no title and no content – but it was apparently unread and couldn’t be read or deleted.  It took a reboot to get rid of that, but it’s reoccurred after some more use.  It seems that I’m not alone but the immediate fix (use the multitaking switcher to kill mail) doesn’t work for me because this iPod doesn’t support multitasking.
  4. Although all my media seems to have synced over OK, one of my podcasts (This Week in Tech as it happens) is duplicated on my iPod.  The same episodes appear in both copies.  I’ve removed and readded the podcast from iTunes, but it just duplicates it all over again.  No idea what else to do with this.

This experience has further reinforced my love hate relationship with Apple.  When their stuff works it’s great.  Well thought out, nicely implemented and consistently integrated.  Everything you could hope for really.  The problem comes when things don’t work out properly.  The ecosystem within which Apple operates is so closed that they won’t acknowledge when problems exist and you’re left with an unknown wait until a future update magically fixes things.  People contacting apple support about the IMAP problems have been talked through resetting their mail accounts and a few other fixes which Apple must know don’t work.  Contrast this with open source solutions where all bug tracking is out in the open so you can see not only that the problem exists and has been identified, but also get an official response for the best work around and can track the progress on a permanent fix.  Would adopting this kind of system really cause harm to Apple?

Posted in Computing, Technology | Comments Off


Copying OpenOffice formulas to a whole column

Friday, January 29th, 2010

In my entire working life the single biggest increase to my productivity came the day that I found out that double clicking on the little box at the bottom right of a cell containing a formula would copy that formula to the entire column.  Prior to that, on a bad day, I could spend half an hour just watching a cursor scroll down a very large spreadsheet over and over again.

These days I use OpenOffice for most of my work and this little trick doesn’t work in Calc.  Losing this one thing was enough to make me go back to Excel for my spreadsheet work.

However I’ve now found an equivalent function in Calc which has made me a whole load happier with OpenOffice.

The trick is as follows:

  1. Write the formula in the top cell of a column
  2. Copy this cell
  3. Select all of the rest of cells in the column.  You can do this by clicking on the top cell and shift clicking on the bottom cell so the scrolling is really quick
  4. With all of the cells selected select Paste.  The formula will be copied to all of the currently selected cells, but will be adjusted according to its position in the sheet.

OK, so it’s not quite as quick as just double clicking the cell handle, but it’s pretty close and works well enough to allow me to keep using OpenOffice.

Posted in Computing | Comments Off


Overestimating DWIM in Perl

Saturday, October 17th, 2009

I hit a bug in a script I was writing this week which reminded me that sometimes you can put too much faith in perl’s ability to ‘do what I mean’ (DWIM). It took me a couple of minutes to see what was going wrong here – see if you do any better.

#!/usr/bin/perl
use warnings;
use strict;

my $data;
populate_data($data);
print_data($data);

sub populate_data {
  my ($d) = @_;

  $d->{somekey} = 'somevalue';
}

sub print_data {

  my ($d) = @_;

  if (exists $d->{somekey}) {
    print 'Win!';
  }
  else {
    die 'Fail!';
  }
}

No errors, no warnings, but epic fail.

One of the things which confuses people learing perl is the creation of complex data structures and autovivication.  That is to say if you treat a scalar as a reference to an array of array of array of arrays (for example), then that’s exactly what it becomes – you don’t need to explicitly define the structure as it is all created on the fly.

In this case though I was expecting too much.  When the $data variable is first created it has an undefined value.  When I pass it into the subroutine all that is passed is the undefined value.  Adding data to the undefined value autovivifies the expected data structure, but since this was created in the subroutine there’s no way for that to propogate back to the calling code.

The answer is to define at least the top level of structure before calling the populate_data subroutine.  That way you are passing in a valid reference to the subroutine (where further structure can be added), but the data is added to an anonymous data structure which is still referred to in the calling code.

The simple fix is therefore to make the initial declaration of $data be:

my $data = {};

..and Fail becomes Win.

Tags: , , ,
Posted in Computing | Comments Off


Sound Level Monitoring in Java

Tuesday, October 6th, 2009

For a project I’ve been working on I needed what I thought should be fairly easy to make – a simple widget to monitor an input sound level.  I’d never worked with the full javax.sound API before, but assumed that there would be ample documentation to do what I needed.

Having started on the project things turned out to be a bit more tricky than I thought.  Most of the examples I found revolve around passing sound from an input to an output, and the sound API makes this fairly easy.  What proved to me more tricky was the intercepting and processing of a sound input.  I’ll therefore go through what I did to make this work in the end.

Let’s start with the basics:

What you need at the end of the day is a TargetDataLine object.  The nomenclature of the javax.sound API is somewhat unusual in that a SourceDataLine is a sound output line (such as a set of speakers), whereas a TargetDataLine is a sound input line (such as a microphone).

You have two choices for getting a TargetDataLine, you can either go through a rigorous process of exploring the capabilities of all of the sound lines on your system and find some way to select the best match to what you want.  Alternatively you can decide on the type of line you want and request that from the sound system.  If this type of line is not available then you will get a LineUnavailableException.

Having tried both approaches I ended up plumping for the second option.  Since I didn’t need high precision and wasn’t asking anything difficult of my sound source I could select conservative properties for my line and trust that nearly all sound cards would be able to support this.  In practice this code has worked on every machine I’ve tried it on, but conceivably it could fail on some sound cards.

You need two things to get a TargetDataLine.  An AudioSystem and an AudioFormat object.  The AudioSystem class provides a series of methods through which you can access the components of the audio system on your machine.

An AudioFormat object defines the characteristics of the sound stream you want to obtain.  There are a few different parameters you must set and there are a range of acceptable values which will be supported by the majority of sound cards:

  • Sample rate: This says how many times per second the sound will be sampled
  • Sample size: The number of bits in each sample taken
  • Number of channels: Whether this input is mono or stereo
  • Signed: Whether the sound samples are signed or unsigned
  • Endianness: For multi-byte sample sizes says whether the byte order is big or little endian

Since I was only interested in monitoring a line level I chose fairly conservative values:

  • Sample Rate: 8000.  This is about the lowest commonly used sample rate.  CD quality sound is 44.1kHz, but this is overkill for a simple sound monitor
  • Sample Size: 16 (bits).  You can also use 8 bit samples depending on the resolution you are after.
  • Channels: 1 My interest was in overall sound level so mono sound was OK
  • Signed: true.  This is the more common option, and is easier to deal with in java since all java primitives are signed.
  • Endianness: Since my samples are 16 bit then byte order matters.  Most common audio formats (eg WAV) are little endian as is the x86 architecture, so this is the common choice.

In practice this means that to get a TargetDataLine I can do something like this:

 AudioFormat audioFormat = getAudioFormat();
 targetDataLine = (TargetDataLine) AudioSystem.getTargetDataLine(audioFormat);
 private AudioFormat getAudioFormat(){
   float sampleRate = 8000.0F;
   int sampleSizeInBits = 16;
   int channels = 1;
   boolean signed = true;
   boolean bigEndian = false;
   return new AudioFormat(sampleRate,sampleSizeInBits,channels,signed,bigEndian);
 }

Once I have my TargetDataLine I then need to activate it before I can read any data from it.  Activating the line is as simple as:

 targetDataLine.open();
 targetDataLine.start();

Once the line is open you can begin to read data from it.  The usual way to do this is in a separate thread where you have a buffer which you fill with data from the line and then process.  A read from a line will block until the buffer is full, so the size of your buffer will determine how often you process the data.

The amount of data produced will be a function of your sample rate, sample size and number of channels.  If you have one channel and 16 bit samples at 8kHz then every second you will produce 8000 * 16 * 1 bits of data or 16000 bytes of data.  Therefore a 16000 byte buffer will be filled every second, and 8000 byte buffer will fill in half a second.

Reading the input can therefore be done is a loop as follows:

byte [] buffer = new byte[2000];
while (true) {
  int bytesRead = targetDataLine.read(buffer,0,buffer.length);
}

The remaining problem is then how to process the filled buffer to get the overall level.  There are a few options for this, you could work out the average sound level over the whole sample, or you could work out the peak level throughout the sample.  I chose the latter method, but the basic process would be the same in either case.

Since my samples were 16bit I had to take into account that each sample was composed of two bytes, and I needed to recombine these into a signed short before processing them.  This involves using a bit shifting operation to combine the two bytes, and taking into account the little endian byte order.

short max;
if (bytesRead >=0) {
 max = (short) (buffer[0] + (buffer[1] << 8));
 for (int p=2;p<bytesRead-1;p+=2) {
   short thisValue = (short) (buffer[p] + (buffer[p+1] << 8));
   if (thisValue>max) max=thisValue;
 }
 System.out.println("Max value is "+max);
}

For an 8 bit sample I could use the individual bytes directly.  For a big endian sample the p and p+1 positions in the generation of the short would have to be reversed.

Using this method I can now sample the input line at any rate I choose to get the raw data from which a peak meter could be produced.  The final thing to remember is that our perception of sound should always be viewed on a log scale.  If you are creating a sound meter you should therefore log transform the data before plotting to gain a more realistic view of the sound level.

Tags: , , ,
Posted in Computing | Comments Off


Managing Really Large Data Sets

Thursday, August 27th, 2009

For a while now I’ve been working with next generation sequencing datasets.  Each dataset consists of around 10 million mapped genome positions, and an experiment can consist of 10 or more datasets.

When analysing this data memory usage is a major issue.  Up until now our approach has been to try to store everything in memory as efficiently as possible but even using machines with 12+GB of memory we’re hitting scaling limits.

To try to alleviate this I’ve been investigating using disk based storage rather than putting the data in RAM.  I’d avoided this before as I wanted the application to be as responsive as possible and didn’t want the overhead of the extra IO.

I’ve now taken the plunge and started experimenting with disk based storage, and my initial results are encouraging.

I chose to go with SQLite using the SQLiteJDBC connector.  My inital attempt was pretty basic using only a single table as a sort of disk based Hashtable and I hadn’t even got round to adding indices to speed up searches.  Despite this the speed of data access was so quick I couldn’t tell the difference from the in-memory storage, and the memory usage dropped significantly.

Since this worked out so successfully I’m not considering doing a major rewrite to put all of the application data into a single large SQLite database and just caching objects for the currently visible chromosome in memory (to allow quick redraws).  This is going to require some major refactoring, but it should allow the program to scale to much larger datasets and should make opening a project a virtually instantaneous operation.

Posted in Bioinformatics, Computing | Comments Off


Optimising Java Memory Usage

Sunday, June 21st, 2009

In the application I’m working on I have to deal with large amounts of data, which means handling tens of millions of java objects.  One of the biggest problems we face is keeping the memory usage of the program under control.

The main object which accounts for the vast majority of the memory consumption contains only a few fields.  These are:

  • A reference to a different object
  • An int to specify a start
  • An int to specify and end
  • An int to act as a flag (which only has 3 values)

To try to improve the memory profile of this object I tried changing the flag value to use a byte instead of an int, and also storing a start and a length (as a short) instead of an end value (as an int).

Making these changes I was surprised to find that the memory usage of the application didn’t change at all.  I didn’t see how this could be, since I was using smaller data structures so I went looking for details of how java allocates memory.  I was surprised at the results!

Basically, most class level variables in java consume the same amount of memory (4 bytes), except for longs and doubles which consume 8 bytes.  This means that it doesn’t matter if your variable is a boolean or an int it still uses the same amount of memory.  If you put a series of variables into an array then there is an overhead for using the array in the first place, but the individual members of that array are packed efficiently such that an array of bytes would take less memory than an array of ints of the same length.

In our case I tried out a few different options for combining the 3 ints at the core of our object to try to save memory.  I took the standard 3 int version of the class as the basis for comparison.

  • As stated above moving to an int, a short and a byte used exactly the same amount of memory, and slowed the application down slightly as more calculation had to be done to extract the end value each time.
  • Using a single long and bitshifting to fit an int, short and byte into it did save memory, but is not as efficient as I’d hoped since there is still one unused byte in the long – and I’m only using 3 possible values out of the range which I could encode in the byte.
  • Using a byte array to store 7 bytes (not wasting the extra byte used in the long) was much less efficient than any of the other solutions.

We’ve therefore gone down the route of packing data into a single long, which is saving ~20% of the previous memory usage, at the expense of some overhead during the packing/unpacking.  The packing overhead is pretty minimal though, adding only 2 seconds on a test which took around a minute in the original code.

Doing the packing/unpacking is fairly straightforward using the bitshifting operators in java.  In the following excerpt packedPosition is a long, start is an int, length is a short and strand is a byte.

packedPosition = start;
packedPosition += (long)length<<32;
packedPosition += (long)strand<<48;

Getting the data back out is also fairly simple.  For example, extracting the length would be achieved by doing:

(short)(packedPosition>>32;

We’re now looking for other cases where adding this extra complexity is worth the memory savings we can achieve.


                

Posted in Computing | Comments Off


Exporting SVG from Java

Sunday, February 8th, 2009

Several programs I’ve written in Java have had an image export componet to them.  Up until now the export has always only been as a bitmap image.  This is very easy to do using a BufferedImage and an ImageWriter.  Given a Component (all AWT or Swing widgets) and a file you can create a PNG very simply:

  BufferedImage b = new BufferedImage(c.getWidth(),c.getHeight(),BufferedImage.TYPE_INT_RGB);
  Graphics g = b.getGraphics();
  c.paint(g);
  ImageIO.write((BufferedImage)(b),"PNG",file);

I’ve always thought though that it should be possible to create an SVG file from an arbitrary component, but never got round to trying it.  However I got snowed in last week and decided to give it a go, and it turned out to be somewhat easier than I thought.

There is a natural fit between the abstract methods of the AWT Graphics class and the primitive components in the SVG spec – they even use the same coordinate system.  Creating an SVG export class therefore simply requires implementing the SVG spec and translating the method calls to SVG code.

Although the basic premise is fairly straightforward – there were a few gotchas which had be scratching my head!

  1. The Graphics class has a create() method which returns a Graphics object.  Initially I was just returning the same object each time and this seemed to work.  With more complex nested objects though the coordinates were getting messed up.  I worked out that it was important that each Graphics instance keeps track of its own translate coordinates as these are managed separately in each instance.
  2. Translate coordinates are relative and not absolute.  When a new set of coordinates are passed via the translate() method these must be added to the current translation, and don’t replace it.
  3. Fonts are problematic.  Size is fine, but font names are not trivially converted between the java font name and something SVG can understand.  It may be possible to figure out some generic rules or use a translation table, but since my applications all use a default sans-serif font of varying sizes I’ve hardcoded this for now.
  4. Some methods, such as font metric creation are a bit of a pain to write – so I’ve cheated and created an unused BufferedImage Graphics object within my class of the appropriate size and pass any difficult method calls to that – I can then discard it at the end.  This is slightly wasteful of memory, but is a quick fix.
  5. I coded my initial implementation on a Mac and it was all working.  I then tried it on Windows and Linux and got empty SVG files.  I resorted to a question on StackOverflow to figure out what was going on, and it turned out that the double buffering mechanism was causing the problems.  If this was enabled then all I saw in my Graphics class was a single call to drawImage() with a complete bitmap in it.  This was passed directly from the offscreen buffer, but meant I couldn’t intercept the individual shape method calls.  I therefore now use the RenderManager to disable double buffering on the component before calling its paint method which ensures nothing comes between my class and the Graphics methods calls.
  6. Lots of code seems to assume that the object which gets passed to paint always implemets Graphics2D.  I put a conditional check into my code for this, but it might be a good idea to implement (with null methods) the extra Graphics2D methods if this was to be used genrically.

At the end of all of this I now have a class which implements a single static method of:

public static String convertToSVG (Component c);

Which has worked in every instance I’ve tried.  There are some methods I’ve not implemented – namely the paintImage methods and some of the drawPolygon methods.  The polygon methods should be easy enough.  The paintImage may be more problematic, but I don’t feel too bad leaving these out since there’s not much advantage to using SVG if you’re just going to stick a bitmap into it.

The SVG code will be in the 0.3 release of SeqMonk and I may release it as a stand alone package if anyone shows any interest in it.


                

Tags: , ,
Posted in Computing | Comments Off


Broadcom problems under linux

Thursday, November 20th, 2008

I’ve been trying to install Fedora 9 on a Dell PowerEdge T100 and have been tearing my hair out trying to get the network card to work.

The card in question is a Broadcom Netextreme BCM5722 Gigabit Ethernet PCI Express adapter.  I’ve used similar Broadcom cards in other machines without any problems, but this time around it didn’t want to play nicely.

Since I was doing an NFS install I couldn’t really do anything until this worked, but trying either a DHCP or a static configuration failed.

From a static configuration the error I got was:

result of pumpSetupInterface is pumpSetupInterface failed: create route - 1:Operation not permitted

From a DHCP configuration it was:

DHCPv4 eth0 - TIMED OUT

All the other messages about the card seemed OK.  The tg3 driver recognised it and even got as far as setting up the link and declaring it ready:

tg3: eth0: Link is up at 100Mbps, half duplex
tg3: eth0: Flow control is off for TX and off for RX
ADDRCONF (NETDEV_CHANGE) eth0: link becomes ready

After much tinkering I found that if I booted the machine with the ethernet cable disconnected and then only connected it once the card was trying to obtain an address everything suddently started working.  If the cable was connected from the start then the interface would never come up.  This provided me with enough of a work round to get Fedora 9 installed and updated.

Fortunately it seems the problem with this card and the tg3 driver has been fixed between the initial fedora 9 release kernel and the current update kernel (2.6.27.5-37.fc9.x86_64), such that after updating I was able to reboot without having to disconnect the ethernet cable and everthing still worked.

Tags:
Posted in Computing | Comments Off


Troubleshooting IMAP SSL

Wednesday, November 5th, 2008

I’ve just spent a while trying to troubleshoot my SSL IMAP connection.  This is the first time I’ve had to do any diagnostics since switching to an SSL secured mail connection.

When my connection stopped working I got only a very non-specific error from OSX mail, and no error at all from Thunderbird (it just hung).  If I was using an unsecured connection I’d usually try to check the connection manually using telnet, but trying this against the SSL port on my IMAP server didn’t get any response.

Having done some digging I found that you can test an SSL secured connection using the tools included with openssh.  In the case of IMAP you can connect to the server using:

openssl s_client -connect mail.example.com:993

In my case this failed (hence things not working!), with the error:

CONNECTED(00000003)
write:errno=54

Reading through the openssl documentation I found that this error usually results from the connection not being able to auto-negotiate a suitable ssl version to use.  If this is the case you can force a specific ssl version using:

openssl s_client -connect mail.example:993 -ssl2

or

openssl s_client -connect mail.example.com:993 -ssl3

If you want more information you can also add -debug to the command to see a full list of the commands being sent and a hex dump translation.

In my case I found that the connection only worked when sslv3 was forced, forcing sslv2 or allowing the connection to autonegotiate caused the connection to fail.  Since none of the mail clients I could find allow you to force a specific ssl version my email wouldn’t work.

Fortunately my hosting provider Orchard Hosting were very quick to respond when I reported this and have fixed things.

Tags:
Posted in Computing | Comments Off