How journals could “add value”

I wrote a piece for Genome Biology, you may have read it, about open science.  I said a lot of things in there, but one thing I want to focus on is how journals could “add value”.  As brief background: I think if you’re going to make money from academic publishing (and I have no problem if that’s what you want to do), then I think you should “add value”.  Open science and open access is coming: open access journals are increasingly popular (and cheap!), preprint servers are more popular, green and gold open access policies are being implemented etc etc. Essentially, people are going to stop paying to access research articles pretty soon – think 5-10 year time frame.

So what can journals do to “add value”?  What can they do that will make us want to pay to access them?  Here are a few ideas, most of which focus on going beyond the PDF:

  1. Live figures: This is something F1000Research are already doing, and there’s no reason other journals couldn’t do the same.  Wherever there is a figure, let readers interact with it – change the colours, axes, chart type etc etc
  2. Re-run analyses in a browser: I think this is something Gigascience have attempted, and would be an incredible training opportunity.  Let readers repeat the entire data analysis, in a browser, using e.g. Galaxy or iPython Notebook
  3. Re-run analyses from the command-line: as above, but provide a VM and a list of commands that readers can run to repeat the analysis
  4. Re-run analyses as an R package: imagine it – every paper has a downloadable R package, with the data included, that allows readers to explore the data within the wonderful statistical and visualisation environment that is R.
  5. Add/remove data – “what happens to graphs, tables and statistics if I remove these data points?”
  6. Show discarded data – so the data had outliers that were cleaned/removed?  Show them to me.  Where are they in the graphs?  Where are they in the tables?  Why were they discarded?
  7. What would my data look like? – basically, if the reader has a dataset which is similar to that analysed in the paper, the reader can upload that dataset (in agreed format) and see how the graphs, tables and stats change
  8. Enforce conventions – when you’re parsing that word document and converting to PDF, pick up gene names and automatically suggest community standards (e.g. HGNC)
  9. Enforce data submission (or do it for the authors).  Yes, do not publish unless the data are submitted to a public archive.  In fact, help the authors do it!
  10. Find similar data in…. – help readers find similar (public) datasets in public databases
  11. Actually check the statistics.  Yes you, the journal.  Most biologists aren’t qualified to check the statistics and experimental design, or do power analysis, so you do it.  Employ some statisticians!

OK, so I’m not pretending any of the above is easy, but I unsure why none of the above is happening – some publishers make HUGE PROFITS, why on earth have they not updated their product?  Imagine if Apple were still trying to sell version 1 of the iPod – no-one would buy it.  Most products get updated on a frequent basis to keep customers happy, yet journals have stuck with the PDF/Print version for decades.  Time for an update, especially if you want to keep the money flowing in.

On bioinformatics, and torturing the mechanic metaphor

I acted as editor and reviewer for a recent opinion piece in Frontiers titled “Who qualifies to be a bioinformatician?”.   Whilst many find this navel-gazing boring, I do understand the need for bioinformaticians to define themselves, with so much expected of us, from setting up BLAST servers, administering websites, carrying detailed statistical analysis, supporting core facilities, supporting researchers, building databases, training others and fixing the bloody printer.

In the above paper, the following section seems to have hit home, with a few people picking up on the mechanic metaphor:

Bioinformaticians are scientists who develop and conduct research based on a bioinformatics approach, they do not just use the tools to better understand a biological problem. It is a little like saying that driving your car to work does not make you a mechanic.

The message is clear.  If “all” you do is use bioinformatics software, then you are not a bioinformatician; equally, if “all” you do is drive a car, it doesn’t make you a mechanic.  I can see what the authors are going for here, and I agree, but only to an extent.

You see, a mechanic can take an existing car, and they can fix it; replace parts. They can tweak it and improve performance.  But they can’t build you a car from scratch.  So we can extend/torture the mechanic metaphor and say “just because you can fix a car, it doesn’t mean you can build one”.  So perhaps “mechanics” are like pipeline builders in bioinformatics.  They can put things together, but ask them to create something new, and they are (often) lost.

We can go further.  The people who build cars are not those who design them.  That’s a completely different set of skills.  Of course in software, design and build are often (but not always) carried out by the same people, but as we all have seen, that can have disastrous consequences, resulting in sophisticated software that no-one can use.  We may wish to think harder about separating design from build in bioinformatics, there are benefits.

Beyond the designers are the engineers, who figure out how the car will actually work.  Improving on the original design of the internal combustion engine; improving performance; engineering new parts and new techniques that will make the car better, quicker, safer, more powerful or more economic.

So which type of bioinformatician are you?  Engineer, designer, builder, fixer (mechanic) or user?  Oh wait, I forgot, the “user” isn’t a bioinformatician.  So what are they?  Hello “Data Scientist”!!

Excited by bacterial DNA in sweet potatoes? Check this out!

Cool paper – turns out that the genome of the sweet potato has segments of the Agrobacterium genome in it! Genuinely cool stuff.

However, I found something even more exciting!  A tiny virus genome present in huge numbers of sequenced bacteria.  Check it out:

  1. GO here: http://blast.ncbi.nlm.nih.gov/Blast.cgi
  2. Click on Nucleotide Blast
  3. Where it says “Enter accession number, gi or FASTA”, simply enter the number: 9626372
  4. Where it says “Database” make sure you choose “others”
  5. Just under there, where it says “Organism (optional)” start typing the word Bacteria, and choose “Bacteria (taxid: 2)”
  6. Click the Blast button at the bottom

OMG!  This tiny viral genome is in E coli, it’s in Acinetobacter, Desulfitobacterium, Sphingobacterium….

I’M EMBARRASSED FOR THEM! #SHAME

On publishing software

This post is a response to Titus and Daniel‘s blog posts on whether published software should be re-useable, and covers some elements of an argument I had with Aylwyn on Twitter.

What is the purpose of an academic paper?

It’s a good question.  Reading up on the history of academic publishing, it seems that initially academic papers were little more than stories, with “trade secrets” kept by the authors and not revealed.  This did not work very well, and (this is unreferenced!) I like this sentence from the above Wiki page: “The Royal Society was steadfast in its not yet popular belief that science could only move forward through a transparent and open exchange of ideas backed by experimental evidence”.

Adding some of my own thoughts, when describing methods in a paper: is the point of a publication to show that you can do something?  Or is the point of a publication to show everyone else how to do it?

And what do your funders want?  I think I can guess.  Most government funding bodies want to drive the knowledge economy because it stimulates job creation and economic growth.  It also contributes to general health, happiness and well-being.  Sometimes this is best achieved by protecting intellectual property and commercialising discovery; however, most often it is best achieved through the sharing of ideas, knowledge and expertise – the academic publication route.

To put it bluntly – any data and techniques which you choose to keep exclusively for your own group’s use and advantage act as a barrier to discovery, because no matter how good you are with them, you plus hundreds of others would achieve far more.

So – the point of an academic paper is to share ideas, knowledge and expertise.  It’s not an exercise in showing off (“look what I can do!”)

Software: What is the point of your paper?

Titus mentions single-use software and I have some sympathy here.  If the point of your paper is not the software, then documenting, testing, writing tutorials, releasing and supporting can be very onerous and is somewhat of a distraction.  For example, if your paper is about a genome and how members of that species evolved, but to assemble that genome you write some scripts that hacked a few different assembly packages together, then you should definitely (i) release the scripts and (ii) describe the method in some detail, but I can see that this is single-use software not meant for re-use and doesn’t need to have all the bells and whistles attached to it.  So I think Titus and I agree on that one.  Release it, mark it up as “not for re-use” and move on.

Publishing a theory paper

This is a grey area.  I’d like to draw an analogy to laboratory methods here.  Let’s say some bright spark has an idea.  Maybe they look at CRISPR and Cas and they think “hey theoretically, that could be used to edit genomes”.  They write up their theory and publish it.  Fine.  It’s a theory, but without an implementation, pretty low impact.  It’s all a bit Elon Musk.  Then imagine the bright spark instead shows that they can edit a genome using CRISPR-Cas, but the method will only work for one gene in one organism.  We have the theory, and a limited-use implementation.  Again, publishable, but not very high impact.  Finally, imagine the same bright spark, they have the theory, and they have an implementation which can be used for most genes in most organisms.  That’s the wow paper right there.

We can apply this to theory papers in computational biology.  There are plenty of journals in which to publish theories.  If you have a theoretical method, publish it there.  The next stage is a theory with a software implementation that is not intended for re-use.  It’s fine, but it’s low impact and should be treated as such.  There are plenty of software tools that work in a single, given situation, but cannot and should not be applied elsewhere.  There are plenty of models that are over-trained and over-fitted.  Single-use software is not necessarily good evidence for your theory.  Finally, the gold standard – theory, and software that is generally applicable to many instances of a problem.  Clearly the latter is the highest impact, but more importantly, the two former cases are (in my opinion) of very limited use to the community.  I’ll give you an example: I have a theory (and many others do too!) that uncertainty in sequence reads should be represented as a graph, and that we should align read graphs to reference graphs to extract maximum information content in genomics.  Shall I just sit back and wait for the phone call from Nature?  I think not.  Ah but here, I have a Perl script that takes a single read-as-a-graph and aligns it perfectly to the human genome.  Again, Nature?  Maybe Science this time?  Hmmm?

Software/method publication by stealth

The problem of accepting a paper where the authors present a novel method but a software implementation that cannot be re-used, is that this is software publication by stealth.  Because the software will be re-used – by the authors.  This is a way in which computational biologists can publish a method without enabling others to easily carry out that method; it’s back to the “trade secrets” publication style of the 1700s.  It’s saying “I want the paper, but I am going to keep the software to myself, so that we have an advantage over everyone else”.  In short, it is the absolute antithesis of what I think a publication is for.

Software publications

It’s clear to me that publications focused on the release of software should of course make the software re-useable.  To jot down a few bullet points:

  1. The source code should be readable
  2. It should compile/run on readily available platforms
  3. The authors should list all dependencies
  4. The authors should state all assumptions the software makes
  5. The authors should state when the software should be used and when it shouldn’t be used
  6. There should be a tutorial with test datasets
  7. The authors should demonstrate use on both simulated and real data
  8. There should be methods by which users can get help
  9. If command line, the software should print out help when run with no parameters

What do you think?

Recreating a famous visualisation

Many of you will be aware of the Project Tycho visualisation(s) about vaccination that were featured in the Wall Street Journal – if you’re not, you should be.  They’re beautiful, and almost everything a good visualisation should be:

wsj_projecttychoThe plots shown on the WSJ website are actually interactive and use Javascript, which is a bit beyond the scope of this blog post, but I wanted to show how you can create something similar.

The Data

I can’t give you the data due to T&Cs, but you can download it yourself.  Go to Project Tycho, register, log in, choose level 2 data, click “search and retrieve data”, choose state rather than city, add all of the states to the selection box, unclick graph and Submit query.  There is then an option to download in Excel.  The Excel is actually CSV and we can import this into R quite easily.  This is what (a subset of) my data look like in R:

     YEAR WEEK ALABAMA ALASKA AMERICAN.SAMOA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT
1096 1930    1       7      0              0       4      196        178       18          64
1097 1930    2      24      0              0       0        2        442       69          62
1098 1930    3      28      0              0       2        9        490       26          44
1099 1930    4      21      0              0       1        3        628       40          23
1100 1930    5      47      0              0       5        7        864      101          25
1101 1930    6      63      0              0       6        6        943      101          24
1102 1930    7     108      0              0       5        5        954       65          20
1103 1930    8     167      0              0       2        9       1151      170          13
1104 1930    9     191      0              0       7       15       1433      150          23
1105 1930   10     242      0              0       5        8       1514      256          39

We need to do a few things – limit data to beyond 1930, substitute the “-” for zeros, convert everything to numeric, and sum over weeks:

d[d=="-"] <- 0
for (i in 2:61) {
   d[,i] <- as.numeric(d[,i])
}
d <- d[d$YEAR>=1930,]
yd <- aggregate(d[,3:61], by=list(year=d[,1]), sum)
yd <- yd[order(yd$year),]

My data now look like this:

   year ALABAMA ALASKA AMERICAN.SAMOA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE
1  1930    4389      0              0    2107      996      43585    11821        1978      264
2  1931    8934      0              0    2135      849      27807     4787       12869     2428
3  1932     270      0              0      86       99      12618     2376        5701       39
4  1933    1735      0              0    1261     5438      26551      297        5430      205
5  1934   15849      0              0    1022     7222      25650    12946        4659     3063
6  1935    7214      0              0     586     1518      28799    12786       24149     1040
7  1936     572      0              0    2378      107      49092      604        3987     1490
8  1937     620      0              0    3793      322       5107     1027       10181     1796
9  1938   13511      0              0     604     6690      19452     8403        1388      492
10 1939    4381      0              0     479     1724      67180     5182       15817      159

Creating the visualisation

It’s important to note that great visualisations don’t just happen; they take work; they evolve over time; there is trial-and-error until you get what you want.  Don’t be scared of this, embrace it!

I’m going to use the heatmap.2() function from gplots in R.  Let’s look at a basic, raw heatmap (with some things turned off that I always turn off in heatmap.2()):

library(gplots)
heatmap.2(as.matrix(yd[,2:60]), trace="none", key=FALSE)

basic_rawOK… I don’t actually want to cluster years, nor do I want to cluster states, so I should turn off the dendrograms.  I also have the data the wrong way round, so I need to transpose the matrix:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
          dendrogram="none", trace="none", key=FALSE)

trans_rawOK, we start to see a patter where there are more yellow squares to the left of the diagram; yellow in this context means high, and left means early in the period we’re looking at – so we begin to see patterns.  Still a terrible heatmap.  There is lots of wasted space; there are no year labels; and the state names are being cut-off.  Let’s deal with those:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12))

I get an error about the figure margins being too large, but this is quite common and can often be ignored

better_rawWe’re getting somewhere now.  It’s time to think about colours.

Colours

Let’s get something awkward out of the way.  The guys who created this visualisation cheated a little bit.  Let’s look at the colour scale:

wsj_colorsThis is not a natural colour scale.  White-through-to-blue is fine, but blue does not naturally flow into yellow and then red.  So this is a false color scheme which has been designed for maximum impact – low numbers of measles cases get nice colours such as white, blue and green; high numbers get danger colours such as yellow and red.  Is this acceptable?  I’ll let you decide.

In addition, note the WSJ figures have been scaled per 100,000 population.  I do not have the historical population figures for American states going back to 1930, so we are plotting the raw data (raw number of cases) not scaled for population.  So we can’t completely recreate the WSJ figure, but we can get close.

Colours and breaks

In heatmaps, we have the concept of colours and breaks.  Breaks are provided as a vector of numbers, and essentially, we use each colour to represent values that fall between each subsequent pair of breaks.  In other words, the colours fit “between” pairs of breaks.  Perhaps best explained by example.  Let’s say our breaks vector is 0, 100, 200, and our colours are “green” and “red”.  Therefore, any value between 0 and 100 will be green in the heatmap, and any value between 100 and 200 will be red.   As colours fit in the middle between breaks, there should be one less colour than there are breaks.  I often find it useful to set my own breaks and colours in heatmaps – the function can decide for you, but how do you know what assumptions it is making?

Measles

If I want to recreate the colours of the WSJ plot, I need low values between white and blue, and medium-to-high values between yellow and red.  My minimum and maximum values are 0 and 126221:

min(yd[,2:60])
max(yd[,2:60])

I’m going to use colorRampPallette() to create ten colours between white and blue; then 30 colours between yellow and red.  I’m also going to game the breaks, just so I can get something similar to the WSJ look:

cols <- c(colorRampPalette(c("white", "cornflowerblue"))(10),
          colorRampPalette(c("yellow", "red"))(30))

I now need to set my breaks.  I want white to be zero, so my first two breaks should be 0 and 0 (i.e. only values between 0 and 0 will be white; in other words, only 0 will be white).  I’m then going to scale up to 700 in steps of 100; and the rest I am going to split equally between 800 and the maximum value (which I’ll set at 127000).

bks <- c(0,0,100,200,300,400,500,600,700,seq(800,127000,length=32))

NOTE: this is a bit dodgy.  Really, my bins should be the same size, but it’s the only way I could recreate the WSJ look with this data.

Now the heatmap:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks)

better_coloursOK!  Now we’re getting somewhere!  So, the WSJ heatmap has gaps between each row and between each column, so let’s add those in:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, rowsep=1:57, sepcolor="white")

datasepWe want a line at 1961, which after a few bits of trial-and-error, turns out to be at x=32:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, rowsep=1:57, sepcolor="white",
        add.expr=lines(c(32,32),c(0,1000),lwd=2))

datasep_lineNow let’s tidy up those year labels:

row.labels <- rep("", 72)
row.labels[c(1,11,21,31,41,51,61,71)] <- c("1930","1940","1950","1960","1970",
                                           "1980","1990","2000")
heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=row.labels, cexCol=1, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white",
        add.expr=lines(c(32,32),c(0,1000),lwd=2))

tidy_labelsWe really should add a title, and we’ll need to go back and make room for it using the lhei parameter:

par(cex.main=0.8)
heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=row.labels, cexCol=1, lhei=c(0.15,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white",
        add.expr=lines(c(32,32),c(0,1000),lwd=2),
        main='Measles cases in US states 1930-2001\nVaccine introduced 1961')

titleAnd there we have it!  I think this is about as close to the WSJ viualisation as I can get!

The only thing I would say is that I am disappointed that I had to game the break points to get here.  It doesn’t sit well with me.  However, it was the only way I could figure out how to get so much blue into the left hand side of the graph, whilst still keeping the right side white.

The simple fact is, that keeping the break points a similar size (100), the graph looks better:

cols <- c(colorRampPalette(c("white", "cornflowerblue"))(10),
          colorRampPalette(c("yellow", "red"))(1261))
bks <- c(0,0,100,200,300,400,500,600,700,seq(800,127000,by=100))
par(cex.main=0.8)
heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=row.labels, cexCol=1, lhei=c(0.15,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white",
        add.expr=lines(c(32,32),c(0,1000),lwd=2),
        main='Measles cases in US states 1930-2001\nVaccine introduced 1961
             \n(data from Project Tycho)')

final

So, there we are :-)  A final (larger version) can be found here.

You probably don’t understand heatmaps

If you work in any area of quantitative biology, and especially if you work with transcriptomic data, then you are probably familiar with heatmaps – used for as long as I have been in research, these figures cluster rows and columns of a data matrix, and show both dendrograms alongside a colour-scaled representation of the data matrix itself.

See an example from one of my publications below:

Pretty simple huh?  Well actually, no, they’re not, and unless you’re a statistician or bioinformatician, you probably don’t understand how they work ;-)

There are two complexities to heatmaps – first, how the clustering itself works (i.e. how the trees are calculated and drawn); and second, how the data matrix is converted into a colour-scale image.

Clustering data

I don’t really have time to explain cluster analysis, which actually refers to a huge range of methods.  I can explain the most simple method though, which is hierarchical, agglomerative cluster analysis.  In a nutshell, this works by first calculating the pairwise distance between all data points; it then joins the data points that are the least distant apart; then it joins the next least distant pair of points; etc etc until it has joined all points.  The tree is a graphical representation of this process.  At some point the process needs to join groups of points together, and again there are many methods, but one of the most common method is to calculate the average pairwise distance between all pairs of points in the two groups.

Know your distance measure

Put simply, the distance measure is how different two data points are.  It is orthogonal to the similarity measure, which measures how similar two data points are.

So how do we calculate distance?  WIth the default methods for both the heatmap() and heatmap.2() functions in R, the distance measure is calculated using the dist() function, whose own default is euclidean distance.  This measures the absolute distance between the points in space, and quite importantly, pays no attention to the “shape” of the “curve”.  To explain this, let’s use an example.  Imagine we have measured the gene expression of 4 genes over 8 time points:

h1 <- c(10,20,10,20,10,20,10,20)
h2 <- c(20,10,20,10,20,10,20,10)

l1 <- c(1,3,1,3,1,3,1,3)
l2 <- c(3,1,3,1,3,1,3,1)

mat <- rbind(h1,h2,l1,l2)

par(mar=c(4,4,1,1))
plot(1:8,rep(0,8), ylim=c(0,35), pch=””, xlab=”Time”, ylab=”Gene Expression”)

for (i in 1:nrow(mat)) {
lines(1:8,mat[i,], lwd=3, col=i)
}

legend(1,35,rownames(mat), 1:4, cex=0.7)

gex

So we have two highly expressed genes and two lowly expressed genes.  Crucially for this example, the two pairs of genes (high and low) have very different shapes.

If we input this data into the dist() function, we get a distance matrix:

dist(mat)

          h1        h2        l1
h2 28.284271                    
l1 38.470768 40.496913          
l2 40.496913 38.470768  5.656854

Here we can see that the least distant are l1 and l2 (distance = 5.65), so they will be clustered together; next least distant are h1 and h2 (distance = 28.28), so these will be clustered together next.  Then finally the two groups will be joined.  This is born out by a naive cluster analysis on the distance matrix:

hc <- hclust(dist(mat))
plot(hc)

hclustand a naive heatmap (I’ve turned off the column tree as in gene expression profiling over time, we generally want the time points to be in the correct, original order):

heatmap(mat, Colv=NA, col=greenred(10))

There are multiple things going on here, so let’s take this one thing at a time.  The clustering has worked exactly as it was supposed to – by distance, l1 and l2 are the most similar so they are grouped together; then h1 and h2, so they are grouped together.  But the heatmap looks terrible, the colours are all wrong.  Why?  Well, despite l1 and l2 being clustered together, their colours do not follow the same pattern; same goes for h1 and h2.  Also, l1 and h1 have the same colours despite having VASTLY different expression profiles; same for l2 and h2.

Understand scaling

Think about the data, and then think about the colours in the heatmap above.  Data points l2 and l2 have exactly the same colours, as do l1 and h1 – yet they have very different values.  That’s because the data points are scaled prior to being converted to colour, and the default in both heatmap() and heatmap.2() is to scale by row (other options are to scale by column, or no scaling at all).  Scaling by row means that each row of the data matrix is taken in turn and given to the scale() function; the scaled data are then converted into colours.

Let’s turn off scaling and see what happens.  Here is the heatmap clustered by euclidean distance with scaling turned off:

heatmap(mat, Colv=NA, col=greenred(10), scale=”none”)

heatmap2Well, this looks slightly better, but still not great!  l1 and l2 are at least both green, and h1 and h2 are at least both red/black (though they still oppose one another).

What’s happening here?  Well, think about a heatmap and what green, red and black mean to you.  Green usually refers to low; red usually refers to high; and black is somewhere in the middle.  Well, without scaling, both l1 and l2 are “low”, so they get given green colours;  the highest points of h1 and h2 are “high”, so they get red; the low points of h1 and h2 are in the middle, so they get black.

Improving things

Is any of this what you would expect?  The answer, I think, is probably no.  Usually, in gene expression profiling, we want to cluster together genes that have a similar profile, or similar shape, over time. When we apply a colour scale, as we do in a heatmap, we give low values green, high values red, and middle values black.  Of course some genes are highly expressed, and some lowly expressed; do we want to give all highly expressed genes the colour “red”, all lowly expressed genes the colour “green”, regardless of shape?  Well sometimes we might, but most often we don’t.  This is why the heatmap and heatmap.2 defaults are quite strange to us – they both scale the data by default, which is great if you want to cluster together data points with a similar shape; but they use euclidean distance, which is not what you want to use to cluster things points by shape.

How do we fix this?  From the gene expression profiles, we know that h1 and l1 have a similar shape, and h2 and l2 have a similar shape, but dist() doesn’t care about shape, it only cares about absolute distance.

How do we cluster on “shape”?  Well, we need to use a different distance measure.  To do this, we actually start with a similarity measure, the pearson correlation co-efficient.  Without going in to too much detail, the pearson correlation coefficient produces a value between -1 and 1; a value of 1 signifies that the shapes of two profiles are identical; a value of -1 signifies that the shapes of two profiles are exactly opposite; and the scale between -1 and 1 represents less or more similar profiles.

The correlation matrix of the above data looks like this:

cor(t(mat))

   h1 h2 l1 l2
h1  1 -1  1 -1
h2 -1  1 -1  1
l1  1 -1  1 -1
l2 -1  1 -1  1

We can see that h1 and h2 have a correlation coefficient of -1 and therefore are very dis-similar; same for l1 and l2.  However, h1 and l1 and h2 and l2 are perfectly positively correlated!  This is what we expect!  However, for clustering (and heatmaps) we need a distance measure, not a similarity measure, so we need to subtract all of these values from 1, which gives:

1 – cor(t(mat))

   h1 h2 l1 l2
h1  0  2  0  2
h2  2  0  2  0
l1  0  2  0  2
l2  2  0  2  0

Here the distance between h1 and l1 is 0; the distance between h2 and l2 is zero; the distance between h1 and h2 is 2 and the distance between l1 and l2 is also 2.

We can draw a naive cluster analysis of this data:

hc <- hclust(as.dist(1-cor(t(mat))))
plot(hc)

hclust_corr

And a simple heatmap:

heatmap(mat, Rowv=as.dendrogram(hc), Colv=NA, col=greenred(10))

This is what we want – genes clustered on the shape of their expression profile; l1 and h1 clustered together, and they have the same colours; and l2 and h2 clustered together, and they have the same colours. I mean, it still looks awful, but that’s because we used a silly example!

A proper heatmap

This example has been on the ARK-Genomics website for some time:

library(gplots)

# read the data in from URL
bots <- read.table(url(“http://genome-www.stanford.edu/cellcycle/data/rawdata/combined.txt&#8221;), sep=”\t”, header=TRUE)

# get just the alpha data
abot <- bots[,c(8:25)]
rownames(abot) <- bots[,1]
abot[1:7,]

# get rid of NAs
abot[is.na(abot)] <- 0

# we need to find a way of reducing the data.
# Sort on max difference and take first 1000
min <-apply(abot, 1, min)
max <- apply(abot, 1, max)
sabot <- abot[order(max – min, decreasing=TRUE),][1:1000,]

# cluster on correlation
hc <- hclust(as.dist(1 – cor(t(sabot))), method=”average”)

# draw a heatmap
heatmap(as.matrix(sabot),
Rowv=as.dendrogram(hc),
Colv=NA,
col=greenred(10),
labRow=””)

heatmap4

I hope by now that you understand that heatmaps are quite complex visualisations; there are actually quite a few more complexities, but more of that in another post!

 

 

The cost of sequencing is still going down

There’s been a bit of chatter on Twitter about how the cost of sequencing is not going down anymore, with reference to that genome.gov graph.  I realise that sensationalist statements get more retweets, but I am sorry, this one is complete crap – the cost of sequencing is still on a long-term downward trend.

By way of qualification, I have been helping to run our University’s next-generation genomics facility, Edinburgh Genomics, since 2010.  We are an academic facility running an FEC model – which means we recover all of our costs (reagents, staff, lab space etc) but do not make a profit.  If you are interested in Illumina sequencing, especially after reading below, you should contact us.

You can read some relatively up-to-date references here, here and here.

What I am going to talk about below are the medium- to long- term trends in sequencing prices on the Illumina platform.  There are fluctuations in the price in any given period (for example Illumina put prices up across the board a few weeks ago), but these fluctuations are very small in the context of the wider, global trend of cost reduction.

HiSeq V3

Back in 2013/14, the most cost-effective method of sequencing genomes was on the HiSeq 2000/2500 using V3 chemistry.  At its best, a lane of sequencing would produce 180million paired 100bp reads, or 36 gigabases of sequence data.  I am going to use this as our base line.

HiSeq V4

After HiSeq V3, came HiSeq V4 which was introduced last year.  All new 2500s could offer this and many new-ish 2500s could be upgraded.  At its best, V4 produces 250million paired 125bp reads, or 62.5 gigabases of sequence data.

Of course, Illumina charge more for V4 reagents than they do for V3 reagents, but crucially, the price increase is proportionally smaller than the increase in throughput.  So, at Edinburgh Genomics, the cost of a V4 lane was approx. 1.2 times the cost of a V3 lane, but you get 1.7 times the data.  Therefore, the cost of sequencing decreased.

HiSeq X

This is rather trivial I think!  By my calculations, a lane of the HiSeq X will produce around 110 gigabases of sequence data, which is 3 times as much data as HiSeq V3, and the cost has come down to about 0.4 times.  Therefore, the cost of sequencing decreased.

HiSeq 4000

The HiSeq 4000 is a bit of an unknown quantity at present as we haven’t seen one in the wild (yet) and nor have we come up with a detailed costing model.  However, my back-of-a-post-it calculations tell me the HiSeq 4000 lanes will produce around 93 gigabases of data (about 2.6 times as much data as HiSeq V3) at about 1.4 times the cost.   Therefore, the cost of sequencing decreased.

Drawing a new graph

Here it is.  You’re welcome.

biomickwatson_thenewgraph

My numbers are accurate, despite what you may hear

It came to my attention last year that some facilities might be offering HiSeq V3 lanes with over 200 million read-pairs/clusters.  It is possible to go that high, but often only through a process known as over-clustering.  This is a bad thing.  It not only reduces sequence quality, but also produces lots of PCR and optical duplicates which can negatively affect downstream analysis.

Other platforms

I haven’t spoken about Ion Torrent or Ion Proton for obvious reasons.  I also haven’t included NextSeq 500 nor MiSeq – to be honest, though these are very popular, they are not cost-effective ways of producing sequence data (even Illumina will admit that) and if you told your director that they were, well, shame on you ;-)

PacBio?  Well it seems the throughput has increased 3 times in the last year:

Despite the need for an expensive concrete block:

So I can’t really see the cost of PacBio sequencing going up either.

MinION and PromethION – costs currently unknown, but very promising platforms and likely to bring the cost of sequencing down further.

Complete Genomics – well, as I said in 2013, they claimed to be able to increase throughput by 30 times:

There is also the BGISEQ-1000, which apparently can do 1 million human genomes per year. Apparently.

All of which means – the cost of sequencing is going to keep coming down.

So why is the genome.gov graph incorrect?

I don’t know for sure, but I have an idea.  Firstly, the data only go to July 2014; and secondly, the cost per genome is listed as $4905, which is obviously crap in the era of HiSeq X.

Can we stop this now?