The one analysis module which seems to elicit more questions than any other is the duplicate sequence plot. Of all of the plots which the program generates it’s probably the one which causes the most warnings / errors in otherwise nice looking data. I’m happy to admit that it’s not always immediately obvious what the plot is telling you, but it does contain some really useful information, so it’s worth getting to grips with it!
So what does the duplicate sequence plot tell you? As with all of the plots in FastQC it’s not there to tell you if your data is good or bad, it’s there to tell you if your data looks unusual in some way. More specifically the point of the duplicate sequences plot is to tell you to what extent you are wasting the sequencing capacity you have paid for by simply resequencing the exact same sequences over and over again.
The point of the plot is that in a diverse random library, even at relatively high coverage you shouldn’t resequence the exact same region multiple times. If across your entire library you are sequencing each fragment multiple times then in most cases you stop learning anything new about your library. You could effectively get the same amount of information by doing less sequencing, which either means you would benefit from a more diverse library, or you could do less sequencing (eg mixing the library with others in a multiplexed experiment).
One of the slightly confusing aspects of the plot is the scale used for the y-axis, but this scale is a result of compromises made to allow FastQC to run in a reasonable amount of memory. In an ideal world we would analyse every sequence in a library and count how many times each one occurred. If we did that we could put absolute numbers on the plot and this would make it easier to interpret. However, if we had a perfectly diverse library with no duplicated sequences this would mean holding the whole library in memory at once which isn’t practical on many machines. What happens instead is that the analysis occurs only for the first 200,000 different sequences seen. The number of occurrences of these sequences is then tracked through the rest of the file, but any new sequences after the first 200,000 are then discarded. Because this precludes the possibility of generating absolute numbers we plot the results on a relative scale, with the number of sequences occurring exactly once being set as 100%, and everything else being shown relative to that number. The plot value for duplicate 1 is therefore always 100%, and higher duplication levels can produce values either higher or lower than this.
In an ideal plot for a diverse library the values for duplicate levels above 1 should quickly decay to zero and stay there. However there are several ways in which the plot can indicate other types of problem in the library.
How things can go wrong
There are a number of different potential modes of failure which show up in the duplicate sequences plot. Here we try to provide examples of some of the most common ones and explain how they might occur.
A small number of sequences dominate an otherwise diverse library
In some cases you may find that a very small number of sequences end up comprising a large proportion of your library. The most common example of this is contamination by adapter sequence but any small source of contamination could conceivably cause a similar effect.
In this case the shape of the duplicate plot would look normal, with the unique sequences being the highest value and the rest of the plot quickly decaying to values close to zero, however the overall duplicate percentage shown at the top of the plot will be very high. The reason the plot looks normal is that it is limited to a duplicate level of 10+ so a very small number of highly duplicated sequences won’t cause the plot to peak.
The easiest way to confirm this type of problem is to look at the overrepresented sequences module results which should list the overrepresented sequences along with the proportion of the library they comprise.
Every sequence in a library occurs a large number of times
If a library has very limited diversity is then subjected to high throughput sequencing then it’s possible that every sequence seen in the library is likely to be seen many times.
In this case the plot trace will be low through all duplication levels until the 10+ bin, where the plot will spike sharply, possibly showing a relative level of many hundreds of percent relative to the unique sequences. The overall duplicate level will probably also be high (>90%).
The root cause of this type of result is low diversity in the original library. This could be because you are hugely over-sequencing a diverse library – a PhiX control lane for example would show a result like this, or it could be that the library diversity is unexpectedly low. The most common cause of unexpected loss of diversity is PCR over-amplification where a small subset of the full library is artificially expanded through the use of too many cycles of amplification. Such libraries can also be found from experiment types which are expected to generate limited diversity, such as those which are based around restriction sites rather than random fragmentation.
A low level of duplication is seen across a whole library
In some cases you will see a less extreme plot where the rate at which the duplicate plot falls from unique sequences is slow – showing appreciable proportions of the library with duplication levels of 4-5, and it may show a small spike in the 10+ bin.
In some cases this will simply be a less extreme version of the last situation where every sequence in the library is subjected to low level duplication, and natural variation spreads this across a somewhat broad range.
In other cases though you might be dealing with a library which has highly variable levels of duplication – and this may have a biological rather than a technical cause. The most common type of library to produce this type of plot is an RNA-Seq library. In this type of library it is expected that some sequences will occur very frequently, and others will be very rare. If you want to see the very rare sequences (eg low copy number transcripts), then you will have to greatly over-sequence the most frequent sequences (eg housekeeping genes), so a high level of duplication in part of the library is unavoidable.
A final warning
One final thing to bear in mind when looking at duplicate sequence plots is that they are based on an exact match between sequences. Any sequencing errors in the library will tend to create artificial diversity in the library – making identical sequences look different because of technical errors. To mitigate against this effect in long reads only the first 50bp of each sequence are used to assess the duplication level – however if the sequence quality in a library is very poor then the duplication plot for a heavily duplicated library might be made to look perfectly normal due to the introduced errors. You should therefore always consider the results of the sequence quality plots (especially the per-base quality plot) alongside the duplication plot to gain a realistic assessment of the true level of duplication.