Excited by bacterial DNA in sweet potatoes? Check this out!

Cool paper – turns out that the genome of the sweet potato has segments of the Agrobacterium genome in it! Genuinely cool stuff.

However, I found something even more exciting!  A tiny virus genome present in huge numbers of sequenced bacteria.  Check it out:

  1. GO here: http://blast.ncbi.nlm.nih.gov/Blast.cgi
  2. Click on Nucleotide Blast
  3. Where it says “Enter accession number, gi or FASTA”, simply enter the number: 9626372
  4. Where it says “Database” make sure you choose “others”
  5. Just under there, where it says “Organism (optional)” start typing the word Bacteria, and choose “Bacteria (taxid: 2)”
  6. Click the Blast button at the bottom

OMG!  This tiny viral genome is in E coli, it’s in Acinetobacter, Desulfitobacterium, Sphingobacterium….

I’M EMBARRASSED FOR THEM! #SHAME

On publishing software

This post is a response to Titus and Daniel‘s blog posts on whether published software should be re-useable, and covers some elements of an argument I had with Aylwyn on Twitter.

What is the purpose of an academic paper?

It’s a good question.  Reading up on the history of academic publishing, it seems that initially academic papers were little more than stories, with “trade secrets” kept by the authors and not revealed.  This did not work very well, and (this is unreferenced!) I like this sentence from the above Wiki page: “The Royal Society was steadfast in its not yet popular belief that science could only move forward through a transparent and open exchange of ideas backed by experimental evidence”.

Adding some of my own thoughts, when describing methods in a paper: is the point of a publication to show that you can do something?  Or is the point of a publication to show everyone else how to do it?

And what do your funders want?  I think I can guess.  Most government funding bodies want to drive the knowledge economy because it stimulates job creation and economic growth.  It also contributes to general health, happiness and well-being.  Sometimes this is best achieved by protecting intellectual property and commercialising discovery; however, most often it is best achieved through the sharing of ideas, knowledge and expertise – the academic publication route.

To put it bluntly – any data and techniques which you choose to keep exclusively for your own group’s use and advantage act as a barrier to discovery, because no matter how good you are with them, you plus hundreds of others would achieve far more.

So – the point of an academic paper is to share ideas, knowledge and expertise.  It’s not an exercise in showing off (“look what I can do!”)

Software: What is the point of your paper?

Titus mentions single-use software and I have some sympathy here.  If the point of your paper is not the software, then documenting, testing, writing tutorials, releasing and supporting can be very onerous and is somewhat of a distraction.  For example, if your paper is about a genome and how members of that species evolved, but to assemble that genome you write some scripts that hacked a few different assembly packages together, then you should definitely (i) release the scripts and (ii) describe the method in some detail, but I can see that this is single-use software not meant for re-use and doesn’t need to have all the bells and whistles attached to it.  So I think Titus and I agree on that one.  Release it, mark it up as “not for re-use” and move on.

Publishing a theory paper

This is a grey area.  I’d like to draw an analogy to laboratory methods here.  Let’s say some bright spark has an idea.  Maybe they look at CRISPR and Cas and they think “hey theoretically, that could be used to edit genomes”.  They write up their theory and publish it.  Fine.  It’s a theory, but without an implementation, pretty low impact.  It’s all a bit Elon Musk.  Then imagine the bright spark instead shows that they can edit a genome using CRISPR-Cas, but the method will only work for one gene in one organism.  We have the theory, and a limited-use implementation.  Again, publishable, but not very high impact.  Finally, imagine the same bright spark, they have the theory, and they have an implementation which can be used for most genes in most organisms.  That’s the wow paper right there.

We can apply this to theory papers in computational biology.  There are plenty of journals in which to publish theories.  If you have a theoretical method, publish it there.  The next stage is a theory with a software implementation that is not intended for re-use.  It’s fine, but it’s low impact and should be treated as such.  There are plenty of software tools that work in a single, given situation, but cannot and should not be applied elsewhere.  There are plenty of models that are over-trained and over-fitted.  Single-use software is not necessarily good evidence for your theory.  Finally, the gold standard – theory, and software that is generally applicable to many instances of a problem.  Clearly the latter is the highest impact, but more importantly, the two former cases are (in my opinion) of very limited use to the community.  I’ll give you an example: I have a theory (and many others do too!) that uncertainty in sequence reads should be represented as a graph, and that we should align read graphs to reference graphs to extract maximum information content in genomics.  Shall I just sit back and wait for the phone call from Nature?  I think not.  Ah but here, I have a Perl script that takes a single read-as-a-graph and aligns it perfectly to the human genome.  Again, Nature?  Maybe Science this time?  Hmmm?

Software/method publication by stealth

The problem of accepting a paper where the authors present a novel method but a software implementation that cannot be re-used, is that this is software publication by stealth.  Because the software will be re-used – by the authors.  This is a way in which computational biologists can publish a method without enabling others to easily carry out that method; it’s back to the “trade secrets” publication style of the 1700s.  It’s saying “I want the paper, but I am going to keep the software to myself, so that we have an advantage over everyone else”.  In short, it is the absolute antithesis of what I think a publication is for.

Software publications

It’s clear to me that publications focused on the release of software should of course make the software re-useable.  To jot down a few bullet points:

  1. The source code should be readable
  2. It should compile/run on readily available platforms
  3. The authors should list all dependencies
  4. The authors should state all assumptions the software makes
  5. The authors should state when the software should be used and when it shouldn’t be used
  6. There should be a tutorial with test datasets
  7. The authors should demonstrate use on both simulated and real data
  8. There should be methods by which users can get help
  9. If command line, the software should print out help when run with no parameters

What do you think?

Recreating a famous visualisation

Many of you will be aware of the Project Tycho visualisation(s) about vaccination that were featured in the Wall Street Journal – if you’re not, you should be.  They’re beautiful, and almost everything a good visualisation should be:

wsj_projecttychoThe plots shown on the WSJ website are actually interactive and use Javascript, which is a bit beyond the scope of this blog post, but I wanted to show how you can create something similar.

The Data

I can’t give you the data due to T&Cs, but you can download it yourself.  Go to Project Tycho, register, log in, choose level 2 data, click “search and retrieve data”, choose state rather than city, add all of the states to the selection box, unclick graph and Submit query.  There is then an option to download in Excel.  The Excel is actually CSV and we can import this into R quite easily.  This is what (a subset of) my data look like in R:

     YEAR WEEK ALABAMA ALASKA AMERICAN.SAMOA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT
1096 1930    1       7      0              0       4      196        178       18          64
1097 1930    2      24      0              0       0        2        442       69          62
1098 1930    3      28      0              0       2        9        490       26          44
1099 1930    4      21      0              0       1        3        628       40          23
1100 1930    5      47      0              0       5        7        864      101          25
1101 1930    6      63      0              0       6        6        943      101          24
1102 1930    7     108      0              0       5        5        954       65          20
1103 1930    8     167      0              0       2        9       1151      170          13
1104 1930    9     191      0              0       7       15       1433      150          23
1105 1930   10     242      0              0       5        8       1514      256          39

We need to do a few things – limit data to beyond 1930, substitute the “-” for zeros, convert everything to numeric, and sum over weeks:

d[d=="-"] <- 0
for (i in 2:61) {
   d[,i] <- as.numeric(d[,i])
}
d <- d[d$YEAR>=1930,]
yd <- aggregate(d[,3:61], by=list(year=d[,1]), sum)
yd <- yd[order(yd$year),]

My data now look like this:

   year ALABAMA ALASKA AMERICAN.SAMOA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE
1  1930    4389      0              0    2107      996      43585    11821        1978      264
2  1931    8934      0              0    2135      849      27807     4787       12869     2428
3  1932     270      0              0      86       99      12618     2376        5701       39
4  1933    1735      0              0    1261     5438      26551      297        5430      205
5  1934   15849      0              0    1022     7222      25650    12946        4659     3063
6  1935    7214      0              0     586     1518      28799    12786       24149     1040
7  1936     572      0              0    2378      107      49092      604        3987     1490
8  1937     620      0              0    3793      322       5107     1027       10181     1796
9  1938   13511      0              0     604     6690      19452     8403        1388      492
10 1939    4381      0              0     479     1724      67180     5182       15817      159

Creating the visualisation

It’s important to note that great visualisations don’t just happen; they take work; they evolve over time; there is trial-and-error until you get what you want.  Don’t be scared of this, embrace it!

I’m going to use the heatmap.2() function from gplots in R.  Let’s look at a basic, raw heatmap (with some things turned off that I always turn off in heatmap.2()):

library(gplots)
heatmap.2(as.matrix(yd[,2:60]), trace="none", key=FALSE)

basic_rawOK… I don’t actually want to cluster years, nor do I want to cluster states, so I should turn off the dendrograms.  I also have the data the wrong way round, so I need to transpose the matrix:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
          dendrogram="none", trace="none", key=FALSE)

trans_rawOK, we start to see a patter where there are more yellow squares to the left of the diagram; yellow in this context means high, and left means early in the period we’re looking at – so we begin to see patterns.  Still a terrible heatmap.  There is lots of wasted space; there are no year labels; and the state names are being cut-off.  Let’s deal with those:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12))

I get an error about the figure margins being too large, but this is quite common and can often be ignored

better_rawWe’re getting somewhere now.  It’s time to think about colours.

Colours

Let’s get something awkward out of the way.  The guys who created this visualisation cheated a little bit.  Let’s look at the colour scale:

wsj_colorsThis is not a natural colour scale.  White-through-to-blue is fine, but blue does not naturally flow into yellow and then red.  So this is a false color scheme which has been designed for maximum impact – low numbers of measles cases get nice colours such as white, blue and green; high numbers get danger colours such as yellow and red.  Is this acceptable?  I’ll let you decide.

In addition, note the WSJ figures have been scaled per 100,000 population.  I do not have the historical population figures for American states going back to 1930, so we are plotting the raw data (raw number of cases) not scaled for population.  So we can’t completely recreate the WSJ figure, but we can get close.

Colours and breaks

In heatmaps, we have the concept of colours and breaks.  Breaks are provided as a vector of numbers, and essentially, we use each colour to represent values that fall between each subsequent pair of breaks.  In other words, the colours fit “between” pairs of breaks.  Perhaps best explained by example.  Let’s say our breaks vector is 0, 100, 200, and our colours are “green” and “red”.  Therefore, any value between 0 and 100 will be green in the heatmap, and any value between 100 and 200 will be red.   As colours fit in the middle between breaks, there should be one less colour than there are breaks.  I often find it useful to set my own breaks and colours in heatmaps – the function can decide for you, but how do you know what assumptions it is making?

Measles

If I want to recreate the colours of the WSJ plot, I need low values between white and blue, and medium-to-high values between yellow and red.  My minimum and maximum values are 0 and 126221:

min(yd[,2:60])
max(yd[,2:60])

I’m going to use colorRampPallette() to create ten colours between white and blue; then 30 colours between yellow and red.  I’m also going to game the breaks, just so I can get something similar to the WSJ look:

cols <- c(colorRampPalette(c("white", "cornflowerblue"))(10),
          colorRampPalette(c("yellow", "red"))(30))

I now need to set my breaks.  I want white to be zero, so my first two breaks should be 0 and 0 (i.e. only values between 0 and 0 will be white; in other words, only 0 will be white).  I’m then going to scale up to 700 in steps of 100; and the rest I am going to split equally between 800 and the maximum value (which I’ll set at 127000).

bks <- c(0,0,100,200,300,400,500,600,700,seq(800,127000,length=32))

NOTE: this is a bit dodgy.  Really, my bins should be the same size, but it’s the only way I could recreate the WSJ look with this data.

Now the heatmap:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks)

better_coloursOK!  Now we’re getting somewhere!  So, the WSJ heatmap has gaps between each row and between each column, so let’s add those in:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, rowsep=1:57, sepcolor="white")

datasepWe want a line at 1961, which after a few bits of trial-and-error, turns out to be at x=32:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, rowsep=1:57, sepcolor="white",
        add.expr=lines(c(32,32),c(0,1000),lwd=2))

datasep_lineNow let’s tidy up those year labels:

row.labels <- rep("", 72)
row.labels[c(1,11,21,31,41,51,61,71)] <- c("1930","1940","1950","1960","1970",
                                           "1980","1990","2000")
heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=row.labels, cexCol=1, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white",
        add.expr=lines(c(32,32),c(0,1000),lwd=2))

tidy_labelsWe really should add a title, and we’ll need to go back and make room for it using the lhei parameter:

par(cex.main=0.8)
heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=row.labels, cexCol=1, lhei=c(0.15,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white",
        add.expr=lines(c(32,32),c(0,1000),lwd=2),
        main='Measles cases in US states 1930-2001\nVaccine introduced 1961')

titleAnd there we have it!  I think this is about as close to the WSJ viualisation as I can get!

The only thing I would say is that I am disappointed that I had to game the break points to get here.  It doesn’t sit well with me.  However, it was the only way I could figure out how to get so much blue into the left hand side of the graph, whilst still keeping the right side white.

The simple fact is, that keeping the break points a similar size (100), the graph looks better:

cols <- c(colorRampPalette(c("white", "cornflowerblue"))(10),
          colorRampPalette(c("yellow", "red"))(1261))
bks <- c(0,0,100,200,300,400,500,600,700,seq(800,127000,by=100))
par(cex.main=0.8)
heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=row.labels, cexCol=1, lhei=c(0.15,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white",
        add.expr=lines(c(32,32),c(0,1000),lwd=2),
        main='Measles cases in US states 1930-2001\nVaccine introduced 1961
             \n(data from Project Tycho)')

final

So, there we are :-)  A final (larger version) can be found here.

You probably don’t understand heatmaps

If you work in any area of quantitative biology, and especially if you work with transcriptomic data, then you are probably familiar with heatmaps – used for as long as I have been in research, these figures cluster rows and columns of a data matrix, and show both dendrograms alongside a colour-scaled representation of the data matrix itself.

See an example from one of my publications below:

Pretty simple huh?  Well actually, no, they’re not, and unless you’re a statistician or bioinformatician, you probably don’t understand how they work ;-)

There are two complexities to heatmaps – first, how the clustering itself works (i.e. how the trees are calculated and drawn); and second, how the data matrix is converted into a colour-scale image.

Clustering data

I don’t really have time to explain cluster analysis, which actually refers to a huge range of methods.  I can explain the most simple method though, which is hierarchical, agglomerative cluster analysis.  In a nutshell, this works by first calculating the pairwise distance between all data points; it then joins the data points that are the least distant apart; then it joins the next least distant pair of points; etc etc until it has joined all points.  The tree is a graphical representation of this process.  At some point the process needs to join groups of points together, and again there are many methods, but one of the most common method is to calculate the average pairwise distance between all pairs of points in the two groups.

Know your distance measure

Put simply, the distance measure is how different two data points are.  It is orthogonal to the similarity measure, which measures how similar two data points are.

So how do we calculate distance?  WIth the default methods for both the heatmap() and heatmap.2() functions in R, the distance measure is calculated using the dist() function, whose own default is euclidean distance.  This measures the absolute distance between the points in space, and quite importantly, pays no attention to the “shape” of the “curve”.  To explain this, let’s use an example.  Imagine we have measured the gene expression of 4 genes over 8 time points:

h1 <- c(10,20,10,20,10,20,10,20)
h2 <- c(20,10,20,10,20,10,20,10)

l1 <- c(1,3,1,3,1,3,1,3)
l2 <- c(3,1,3,1,3,1,3,1)

mat <- rbind(h1,h2,l1,l2)

par(mar=c(4,4,1,1))
plot(1:8,rep(0,8), ylim=c(0,35), pch=””, xlab=”Time”, ylab=”Gene Expression”)

for (i in 1:nrow(mat)) {
lines(1:8,mat[i,], lwd=3, col=i)
}

legend(1,35,rownames(mat), 1:4, cex=0.7)

gex

So we have two highly expressed genes and two lowly expressed genes.  Crucially for this example, the two pairs of genes (high and low) have very different shapes.

If we input this data into the dist() function, we get a distance matrix:

dist(mat)

          h1        h2        l1
h2 28.284271                    
l1 38.470768 40.496913          
l2 40.496913 38.470768  5.656854

Here we can see that the least distant are l1 and l2 (distance = 5.65), so they will be clustered together; next least distant are h1 and h2 (distance = 28.28), so these will be clustered together next.  Then finally the two groups will be joined.  This is born out by a naive cluster analysis on the distance matrix:

hc <- hclust(dist(mat))
plot(hc)

hclustand a naive heatmap (I’ve turned off the column tree as in gene expression profiling over time, we generally want the time points to be in the correct, original order):

heatmap(mat, Colv=NA, col=greenred(10))

There are multiple things going on here, so let’s take this one thing at a time.  The clustering has worked exactly as it was supposed to – by distance, l1 and l2 are the most similar so they are grouped together; then h1 and h2, so they are grouped together.  But the heatmap looks terrible, the colours are all wrong.  Why?  Well, despite l1 and l2 being clustered together, their colours do not follow the same pattern; same goes for h1 and h2.  Also, l1 and h1 have the same colours despite having VASTLY different expression profiles; same for l2 and h2.

Understand scaling

Think about the data, and then think about the colours in the heatmap above.  Data points l2 and l2 have exactly the same colours, as do l1 and h1 – yet they have very different values.  That’s because the data points are scaled prior to being converted to colour, and the default in both heatmap() and heatmap.2() is to scale by row (other options are to scale by column, or no scaling at all).  Scaling by row means that each row of the data matrix is taken in turn and given to the scale() function; the scaled data are then converted into colours.

Let’s turn off scaling and see what happens.  Here is the heatmap clustered by euclidean distance with scaling turned off:

heatmap(mat, Colv=NA, col=greenred(10), scale=”none”)

heatmap2Well, this looks slightly better, but still not great!  l1 and l2 are at least both green, and h1 and h2 are at least both red/black (though they still oppose one another).

What’s happening here?  Well, think about a heatmap and what green, red and black mean to you.  Green usually refers to low; red usually refers to high; and black is somewhere in the middle.  Well, without scaling, both l1 and l2 are “low”, so they get given green colours;  the highest points of h1 and h2 are “high”, so they get red; the low points of h1 and h2 are in the middle, so they get black.

Improving things

Is any of this what you would expect?  The answer, I think, is probably no.  Usually, in gene expression profiling, we want to cluster together genes that have a similar profile, or similar shape, over time. When we apply a colour scale, as we do in a heatmap, we give low values green, high values red, and middle values black.  Of course some genes are highly expressed, and some lowly expressed; do we want to give all highly expressed genes the colour “red”, all lowly expressed genes the colour “green”, regardless of shape?  Well sometimes we might, but most often we don’t.  This is why the heatmap and heatmap.2 defaults are quite strange to us – they both scale the data by default, which is great if you want to cluster together data points with a similar shape; but they use euclidean distance, which is not what you want to use to cluster things points by shape.

How do we fix this?  From the gene expression profiles, we know that h1 and l1 have a similar shape, and h2 and l2 have a similar shape, but dist() doesn’t care about shape, it only cares about absolute distance.

How do we cluster on “shape”?  Well, we need to use a different distance measure.  To do this, we actually start with a similarity measure, the pearson correlation co-efficient.  Without going in to too much detail, the pearson correlation coefficient produces a value between -1 and 1; a value of 1 signifies that the shapes of two profiles are identical; a value of -1 signifies that the shapes of two profiles are exactly opposite; and the scale between -1 and 1 represents less or more similar profiles.

The correlation matrix of the above data looks like this:

cor(t(mat))

   h1 h2 l1 l2
h1  1 -1  1 -1
h2 -1  1 -1  1
l1  1 -1  1 -1
l2 -1  1 -1  1

We can see that h1 and h2 have a correlation coefficient of -1 and therefore are very dis-similar; same for l1 and l2.  However, h1 and l1 and h2 and l2 are perfectly positively correlated!  This is what we expect!  However, for clustering (and heatmaps) we need a distance measure, not a similarity measure, so we need to subtract all of these values from 1, which gives:

1 – cor(t(mat))

   h1 h2 l1 l2
h1  0  2  0  2
h2  2  0  2  0
l1  0  2  0  2
l2  2  0  2  0

Here the distance between h1 and l1 is 0; the distance between h2 and l2 is zero; the distance between h1 and h2 is 2 and the distance between l1 and l2 is also 2.

We can draw a naive cluster analysis of this data:

hc <- hclust(as.dist(1-cor(t(mat))))
plot(hc)

hclust_corr

And a simple heatmap:

heatmap(mat, Rowv=as.dendrogram(hc), Colv=NA, col=greenred(10))

This is what we want – genes clustered on the shape of their expression profile; l1 and h1 clustered together, and they have the same colours; and l2 and h2 clustered together, and they have the same colours. I mean, it still looks awful, but that’s because we used a silly example!

A proper heatmap

This example has been on the ARK-Genomics website for some time:

library(gplots)

# read the data in from URL
bots <- read.table(url(“http://genome-www.stanford.edu/cellcycle/data/rawdata/combined.txt&#8221;), sep=”\t”, header=TRUE)

# get just the alpha data
abot <- bots[,c(8:25)]
rownames(abot) <- bots[,1]
abot[1:7,]

# get rid of NAs
abot[is.na(abot)] <- 0

# we need to find a way of reducing the data.
# Sort on max difference and take first 1000
min <-apply(abot, 1, min)
max <- apply(abot, 1, max)
sabot <- abot[order(max – min, decreasing=TRUE),][1:1000,]

# cluster on correlation
hc <- hclust(as.dist(1 – cor(t(sabot))), method=”average”)

# draw a heatmap
heatmap(as.matrix(sabot),
Rowv=as.dendrogram(hc),
Colv=NA,
col=greenred(10),
labRow=””)

heatmap4

I hope by now that you understand that heatmaps are quite complex visualisations; there are actually quite a few more complexities, but more of that in another post!

 

 

The cost of sequencing is still going down

There’s been a bit of chatter on Twitter about how the cost of sequencing is not going down anymore, with reference to that genome.gov graph.  I realise that sensationalist statements get more retweets, but I am sorry, this one is complete crap – the cost of sequencing is still on a long-term downward trend.

By way of qualification, I have been helping to run our University’s next-generation genomics facility, Edinburgh Genomics, since 2010.  We are an academic facility running an FEC model – which means we recover all of our costs (reagents, staff, lab space etc) but do not make a profit.  If you are interested in Illumina sequencing, especially after reading below, you should contact us.

You can read some relatively up-to-date references here, here and here.

What I am going to talk about below are the medium- to long- term trends in sequencing prices on the Illumina platform.  There are fluctuations in the price in any given period (for example Illumina put prices up across the board a few weeks ago), but these fluctuations are very small in the context of the wider, global trend of cost reduction.

HiSeq V3

Back in 2013/14, the most cost-effective method of sequencing genomes was on the HiSeq 2000/2500 using V3 chemistry.  At its best, a lane of sequencing would produce 180million paired 100bp reads, or 36 gigabases of sequence data.  I am going to use this as our base line.

HiSeq V4

After HiSeq V3, came HiSeq V4 which was introduced last year.  All new 2500s could offer this and many new-ish 2500s could be upgraded.  At its best, V4 produces 250million paired 125bp reads, or 62.5 gigabases of sequence data.

Of course, Illumina charge more for V4 reagents than they do for V3 reagents, but crucially, the price increase is proportionally smaller than the increase in throughput.  So, at Edinburgh Genomics, the cost of a V4 lane was approx. 1.2 times the cost of a V3 lane, but you get 1.7 times the data.  Therefore, the cost of sequencing decreased.

HiSeq X

This is rather trivial I think!  By my calculations, a lane of the HiSeq X will produce around 110 gigabases of sequence data, which is 3 times as much data as HiSeq V3, and the cost has come down to about 0.4 times.  Therefore, the cost of sequencing decreased.

HiSeq 4000

The HiSeq 4000 is a bit of an unknown quantity at present as we haven’t seen one in the wild (yet) and nor have we come up with a detailed costing model.  However, my back-of-a-post-it calculations tell me the HiSeq 4000 lanes will produce around 93 gigabases of data (about 2.6 times as much data as HiSeq V3) at about 1.4 times the cost.   Therefore, the cost of sequencing decreased.

Drawing a new graph

Here it is.  You’re welcome.

biomickwatson_thenewgraph

My numbers are accurate, despite what you may hear

It came to my attention last year that some facilities might be offering HiSeq V3 lanes with over 200 million read-pairs/clusters.  It is possible to go that high, but often only through a process known as over-clustering.  This is a bad thing.  It not only reduces sequence quality, but also produces lots of PCR and optical duplicates which can negatively affect downstream analysis.

Other platforms

I haven’t spoken about Ion Torrent or Ion Proton for obvious reasons.  I also haven’t included NextSeq 500 nor MiSeq – to be honest, though these are very popular, they are not cost-effective ways of producing sequence data (even Illumina will admit that) and if you told your director that they were, well, shame on you ;-)

PacBio?  Well it seems the throughput has increased 3 times in the last year:

Despite the need for an expensive concrete block:

So I can’t really see the cost of PacBio sequencing going up either.

MinION and PromethION – costs currently unknown, but very promising platforms and likely to bring the cost of sequencing down further.

Complete Genomics – well, as I said in 2013, they claimed to be able to increase throughput by 30 times:

There is also the BGISEQ-1000, which apparently can do 1 million human genomes per year. Apparently.

All of which means – the cost of sequencing is going to keep coming down.

So why is the genome.gov graph incorrect?

I don’t know for sure, but I have an idea.  Firstly, the data only go to July 2014; and secondly, the cost per genome is listed as $4905, which is obviously crap in the era of HiSeq X.

Can we stop this now?

Defunding research excellence in Scotland

Earlier this year, the results of the Research Excellence Framework were published, a 5-yearly attempt to grade and rank (from best to worst) universities throughout the UK in terms of their research excellence.  It is a far from perfect process, estimated to cost over £1billion, and is highly criticised throughout the academic world.  However, it’s important, because the rankings are used to inform funding decisions – the implicit promise being that the higher up the rankings you are, the more funding you get.

You can download the full REF 2014 results here.

You will see that the results are split by University, then by department, and then finally into Outputs, Impact and Environment.  The definition of these three can be seen here, but for the purposes of this blog post, we will focus only on Outputs – that is the traditional outputs of academic research (e.g. in scientific areas, these would be peer-reviewed papers).

I’ve cleaned up the Scottish data, summarised over all departments, and you can see it here.

Now, Times Higher Education describe two metrics which can be calculated on the data:

  • GPA (grade point average). Loosely speaking this can be thought of as the “rate” of high quality research.  The higher the value of the GPA, the more likely research at that institution will be of high quality.
  • Research Power.  This can be thought of as the volume of high quality research, and is simply the GPA multiplied by the FTE count.

The reason we have both of these is to separate institutions who submit only a very small amount of high quality research to the REF, from those that submit a larger amount of of mixed quality research.

Remember here we are only looking at research outputs.  So how do the Scottish Universities compare to one another?  Here are the Scottish institutions arranged by GPA:

scot14_crop

And here are the Scottish institution arranged by Research Power:

scot14_rp_crop I don’t think there are any real surprises in these tables, if I’m honest.

Now, it is the Scottish Funding Council (SFC) in Scotland that distributes Government funds to universities in Scotland, and we can see the “Research Excellence” settlement for the next 4 years here.  You can download a sanitised version here.

If we take the 14/15 research excellence settlement as the baseline, then we can plot it against both GPA and Research Power: ref_correlate

Whilst in general, universities with a higher GPA generally receive more funding, this is normalised by research volume, and we can see clearly that research funding correlates with research power.

Those of you who have looked at the SFC funding figures will see that they change over the next few years.  So let’s take a look at those changes, using this year as a baseline. Here are the raw figures:

 

losses

Does there look like there’s an outlier to you?  That data point in the bottom right is the University of Edinburgh – ranked first for research power in Scotland and second for GPA in Scotland; also ranked first for loss of funding.  However, Edinburgh receives the most funding from SFC, so let’s look at percentage drop in funding:

losses_pc

This changes the picture a bit: the biggest loss in funding in percentage terms is Robert Gordon University, losing 59% of its 2014 budget over the next 3 years.  Second worst off is the University of Edinburgh, losing 20%; then Dundee (13%); Aberdeen (10%); and St Andrews (4%).

In fact, 4 of the top 6 Scottish universities (ranked by GPA or by Research Power) are receiving a funding cut.  In contrast, 7 of the bottom 8 Scottish Universities (ranked by GPA or by Research Power) are receiving a funding increase.

It is hard to see this as anything other than the defunding of research excellence in Scotland.  The argument is that by taking money from “the rich” and giving it to “the poor”, the SFC will raise the quality of research at some of the lower ranked universities.  Well it might, but there is no guarantee.  At the same time, the opposite will occur – the quality of research at the higher ranked universities will drop.  By my calculations, the University of Edinburgh will lose £17m over the next 3 years.  That can only have a negative effect on research excellence.

I am not against Robin Hood politics – there absolutely should be a continuous re-distribution of wealth from the rich to the poor.  But does this work in a research setting?  The REF ranks universities on “research excellence”.  Universities strive to be as close to the top of those rankings as they possibly can be, and they spend a fortune to do it.  What point is the ranking if the result is that the best are worse off than they were before?  Put another way, could Edinburgh, Dundee, Aberdeen and St Andrews have done anything differently?  They all improved since 2008; they all came in the top 6 in Scotland.  Yet the reward is a cut in funding.  Perversely, had their research quality plummeted and had they been ranked in the bottom 8, would they have received a funding increase?

More likely is that the SFC intended to redistribute funding no matter what the outcome of the REF (note the SFC funding above is called the “research excellence grant”).  If that is the longer term policy of the SFC, that gradually funding will be moved from the higher ranked universities to the lower ranked, then they immediately remove the incentive for the higher ranked universities to improve or even maintain their ranking.  It could be argued that there is also little incentive for the lower ranked universities to improve, as they will get increased funding anyway.

#sigh

Overall, I have to say, I am left completely bemused by the whole REF process.  What the **** was the point?

How to not sack your professors

I have to admit to being  bit shaken by two recent events in UK academia – the tragic death of Stefan Grimm, a Professor at Imperial College, who had recently been told he was “struggling” to “fulfil the metrics” (grant income of £200k per annum); and the sacking of Dr Alison Hayman, a lecturer at Bristol University, for failing to generate enough income from grants.

Let me be clear – this post is not an attack on Imperial nor Bristol, and I am in no way suggesting either has done anything wrong.  I don’t know enough about either case to pass judgement.

Nor do I have a problem with institutions taking steps to remove people from their post who are not performing satisfactorily.  We are all paid to do a job, and we need to do it.  (Note I am not suggesting that there was a problem with either Stefan or Alison’s performance; I don’t know enough to pass judgement)

However, I do have an issue with judging an academics performance by only a single metric – how much money they have “won” in grants.  Firstly because the grants system can be hugely unfair; secondly, and more importantly, because winning grants is not something we do alone, it is a collaboration with other scientists, and also a collaboration with the institution where we work.

In this collaboration, the role of the PI is to come up with fundable ideas; and the role of the institution is to provide an environment conducive to getting those ideas funded.  I don’t think it is fair to sack someone for not winning grants if you have failed to provide an environment within which it is easy to do so.

I am very lucky, because the institution where I work, The Roslin Institute, is exceptional at this.  So I am going to pass on a few tips, simple things that academic institutions can implement, which will mean that no-one has to sack any more academics.

1. Provide funding

This will probably have the biggest impact, but comes at the highest cost.  In my experience, nothing makes a grant more fundable than some promising preliminary data.  Generating data costs money, but it takes data to generate money.  So fund up front.  Give every PI a budget to spend every year with the explicit intention that it should be spent to generate preliminary data for grant applications.  This single, simple step will likely have a greater impact on your grant success than any other.  And make sure it is a decent sum of money.  I recall speaking to a PI from a UK University who told me that each PI gets £150 per year, and out of that they need to pay for printing.  Printing.  3 pence a sheet.  That was over 10 years ago and I’m still shocked today.

2. Cut the admin

Every minute your PIs spend on compulsory training courses, filling in forms, filling in reports, dredging your awful intranet for information that should be easy to find, filling in spreadsheets, monitoring budgets, calculating costs, dealing with pointless emails and attending meetings is a minute they are not spending writing grants and papers.  Cut. The. Admin.  In fact, employ administrators to do the admin.  It’s what they’re good at.

3. Perform independent QC

So  one of your PIs grant proposals are repeatedly rejected.  Does that make them bad proposals?  Or bad ideas?  Perhaps they are excellent proposals, but they don’t hit the right priorities (which means they didn’t go to the right funder, and that might be your fault).  Read the grants yourself and form your own opinion.  Collect the review reports.  Collect the feedback from the funders.  Were they bad proposals?  Or were they good proposals that didn’t get funded?  I really don’t think it’s tenable to sack people if they repeatedly submit grant proposals that are rated as fundable by the committee concerned.  At that point the PI has done their job.

You might also think about putting in place an internal group of senior academics to survey proposals before they are submitted.  This will give you the opportunity to provide feedback on proposals and perhaps make them more fundable before they even reach the grant committee.  Proposals which are really not ready can be kept over until the next submission date, giving time for more improvements

4. Provide support

Do I even need to say this?  For the PIs who have their grants rejected, give them some support.  Give them a mentor, someone who gets a lot of proposals funded.  Provide training and workshops.  Share tips for success.  Sit with them and discuss their ideas, try and influence their future direction.  Do everything you possibly can to help them.

5. Pay it forwards

Every institution has their superstars, the guys who only need to wink at a committee and they’ll get funded.  But those guys, those professors with 10 post-docs and 15 students, they’re black holes that can suck in all the funding around them, making it difficult for others to get a fair share of the pot.  As an institution, you don’t want them to stop, because you want the funding, of course; but there is a compromise, where these superstars share their knowledge and expertise, where they co-author proposals with less successful (perhaps more junior) PIs, lending their name and their weight, their reputation and gravitas, to a proposal.  When the proposal is funded, it is the junior PI who runs the grant and gets the last author publications.  It doesn’t matter to the more senior PI as they probably already have tenure and an H index north of 50.  So pass it on.  Pay it forwards. Transfer that wonderful grantsmanship to the next generation, and coach your next round of superstars.

6. Be excellent

Yes, you.  The institution.  Be excellent at something.  I don’t care whether it’s medicine or ecology, worms or plants or cats or humans, evolution or botany, I couldn’t care less, but choose something and be excellent at it.  Invest in it.  Create a global reputation for that thing so that when reviewers see a proposal they immediately think “these guys know what they’re doing”.  Make sure you have the equipment and the facilities to do it (see 8).

7. Make it easy to work with industry

As PIs we are increasingly being pushed to generate “impact”, and one way of doing this is to collaborate with industry.  But this is a skill and some institutions are very good at it, others very bad.  Be one of the good ones.  Create strategic partnerships with industry, pump-prime some work (again to generate preliminary data), run workshops and industry days and have a legal team that are set up to make it work, rather than to make it difficult.  There are lots of funding streams available only to academic-industrial partnerships, and you’d be insane to ignore them.

8. Invest in infrastructure

Make sure you have the right equipment, the right type of lab, and the right computing to ensure PIs can actually do science.  It seems obvious, but you’d be surprised how many institutions are out there that simply don’t provide adequate facilities and infrastructure.

——————————————

So, there it is.  I’ve run out of suggestions.  As I said above, I am very lucky, I work at The Roslin Institute, which does as much as it possibly can to create an environment where PIs win grants.

Here’s the thing – if you think your staff are failing, for whatever reason, the very first thing you should do is ask yourself this question “Is it our fault?  Could we have done anything more to help them?”.  I’d argue that, in most cases, the answer to both would be “Yes”.  We all share in the success of published papers and grants won; so don’t forget that there is a shared responsibility in failure, and if some of your PIs are not winning grants, at least some of that is the institution’s fault.