An embargo on short read alignment tools

Edit: 13th August 2013

Unbelievable!  After all this time some people are still producing short-read aligners!  And I have to say, this one is the slowest I have ever tried!  It does have the best user interface though!

Gamers to join Ash Dieback fight


Two things happened recently that inspired this blog post. The first was an excellent review that revealed that there are currently over 70 short read alignment tools available today (it should be noted that another list exists at Wikipedia, which has some entries that the EBI list does not;  and lets not forget the infamous SeqAnswers thread).  The second was the publication of another (probably excellent) short read alignment tool in the journal “bioinformatics”.  I’d have called it YANAT, but the authors decided not to for some reason…..

I can’t help but say – I’m sorry, but isn’t this a waste of time, both yours and mine?

I have nothing against the authors of the new tool, whom I am sure are excellent scientists. By some miracle, they might read this blog and comment, or email me, and tell me I’m being unkind. I’ll feel bad. But still, rather than write another tool, why not contribute to the codebase of an existing tool? If BWA is not accurate enough for you, then branch the code and make it so; if Stampy is too slow, speed it up.

Now, I’m well aware of the “bioinformatics process”, and most of the time it works fine.  It often starts with a new problem, or technology.  An initial tool is published that can deal with the data.  Then a raft of new tools are published which improve on the original work, or fill a slightly different niche.  There is then a “survival of the fittest” process, and the best survive to form best practice for analyzing that particular type of data.

I presented this paradigm at the Eagle Genomics Symposium 2012 when I introduced the “Watson Square” of bioinformatics research:


Here in red we have the original and first tool published to tackle the problem;  then we either improve on this tool by getting more knowledge from the data (x-axis: biology) or by getting similar results quicker or using less memory etc (y-axis: technology).  The holy grail, of course, is to improve the efficiency whilst also extracting more biological knowledge.

What I don’t understand is how, with over 70 short-read aligners out there, you can publish a new one that you can show is better than all of the existing tools.  And also, why you would bother?

The excuse I most often hear is that there is no incentive when one contributes to an existing, already published, codebase.  As an academic, publish or perish, and if you write a new tool you will get a paper out of it; if you contribute to an existing tool, at best your name will be lost in a long list of authors, and at worst you won’t get published at all.

However, this argument is a complete fallacy.  Take Velvet as an example, one of the first De Bruijn graph assembly tools, which launched Dan Zerbino into bioinformatics superstardom back in 2008.  Velvet has proven to be an excellent starting point for many others:  Namiki et al have extended the code to work on metagenomes; Torsten Seemann’s group have written an essential wrapper, the Velvet Optimiser, and published Vague, a GUI for Velvet; Matthais Haimel is developing Curtain, which is another wrapper that allows users to add in read-pair information to improve assemblies;  and Daniel himself, and others, have published additional algorithms and published those, including Pebble/Rockband, and Oases.

In fact, any bioinformatics codebase can be seen as a great coral reef, for many others to feed off, and to create an entire ecosystem of tools and extensions.  Surely it’s better that way, rather than swim off on your own and try and establish another, virtually identical reef a few miles away?

Oh, and in answer to the burning question you all have, I use Novoalign.  Why?  Because I value accuracy over speed, because it has some really nice little features, because I can do alignment in a single command and because it has excellent support.

So come on guys – surely now we can have an embargo on the development and publication of short-read mappers?  Please?  We have enough.  In fact, we had enough when there were 20, never mind 70+.  Do yourself, and everyone else, a favour.  Stop.  If you’re short of things to do, why not try writing something that can align/assemble 10-100kb reads instead?

About these ads

30 thoughts on “An embargo on short read alignment tools

  1. Fabien Campagne

    I have not read the paper, but if the abstract and discussion make it clear what contributions the article makes to the field, why would the work not be published?

    However, I do agree with you that it would have been much preferable for the authors to extend an existing aligner if this was possible. In fact, it would be useful if authors could be encouraged to write a paragraph where they explain/justify why they could not reuse/extend existing software and had to develop something from scratch. There can sometimes be good reasons not to extend/reuse, but I think it would really help bioinformatics code reuse if these reasons were made explicit in manuscripts that present new methods.

    Regarding alignment of 10-100KB contigs, checkout the LAST aligner, developed and optimized for comparison of full mammalian genomes. The code is quite readable, so it would be a good start to adapt if you need something along these lines.

    1. biomickwatson

      I don’t want to stop anyone publishing, we’re all in the same boat, and we all need papers. I guess I’m just thinking the time and money spent developing a large number of those aligners could have been better spent elsewhere.

      Here’s a question. You’re a reviewer. You get to review a grant, and they want to develop a new short read aligner. The cost is £400k over 3 years. What would you say?

  2. Luis Pedro Coelho (@luispedrocoelho)

    “However, this argument is a complete fallacy.”

    Actually, it’s not. It may not be always true, but it’s not a complete fallacy. Your counter-examples (fewer than the 70 positive examples, I’ll note) are of major extensions of the tools, which is not where the process is the most inefficient.

    Let’s say I’m reading the paper for ALIGNER and I think “Oh, in step 3, they could have used ALGORITHM for a major speed-up.” I can (1) get their source code, spend a few weeks coding up ALGORITHM and send them back the contribution or (2) spend a few months reimplementing everything from scratch, except that step 3 is now ALGORITHM.

    With (1) I’ll get a mention on the Release notes for the next version of ALIGNER (if it is even maintained), with (2) a paper which I may even get cited a couple of times if I mention the tool at conferences enough. Of course, the social optimum is that more people do (1), but the incentives are that people do (2).

    1. biomickwatson Post author

      My point being that if you significantly speed up an existing algorithm, you’ll be an author on the next paper. Many bioinformatics tools have multiple publications for multiple versions – see bowtie and bowtie2 publications.

      I agree incentives are skewed towards new tools, but not as much as everyone thinks, and there are other forces at work here.

      1. Torsten Seemann

        The problem is that first and last author is what counts. And being a guest contributor to version two is unlikely to get those coveted author positions. But there is no reason the contributors can’t FORK the original as you suggested, and publish that. Then the original authors can merge the branch. Win win?

      2. Luis Pedro Coelho (@luispedrocoelho)

        Not sure that a code contributor would be made into an author. Arguably should be, but no tradition of it (bowtie2 paper had two authors).

        I don’t know about bowtie (although I use it), but most tools are not developed as open-source tools: just discrete releases on publication.

        So your upgrade for version 1 might not even make it into version 2 if that was already being prepared as a completely independent (non-public) branch. I’d love for scientific code to be more like true open source, but we’re not there yet (if we’ll ever be).


        As for forking and publishing: I think it would get rejected as either minor advance or semi-plagarism.

      3. biomickwatson Post author

        I appreciate there may be some barriers, but I think if you branch code and make significant improvements (and to get published, they *should* be significant, shouldn’t they?) then you will get published, either with or without the original authors – though of course they should be engaged in the process.

        And if you only make minor improvements, then that may or may not be worthy of authorship.

        I don’t think reviewers would reject a significant advance of an existing codebase as plagiarism – they shouldn’t. They may be curious why the original authors are not involved.

  3. Christopher Hogue

    Thanks for a great post.

    This problem is endemic to all classes of Bioinformatics software, going back to over a hundred different protein secondary structure prediction tools you could find in the mid 1990s.

    I wrote about it extensively in this 2001 article, Implementation or Algorithm? which I am happy to find is online and free – at Drug Discovery World.

    This whole problem of “papers” motivating software has been part of my 3rd year bioinformatics course for some time now.

    It is a false and wasteful economy to value a paper over the software it describes, but will the madness never end?

    Cheers, from the Previous Generation.
    Christopher Hogue

  4. francis

    I agree with Christopher Hogue, one problem, which I think is resolvable is a metric and a credit for, tool usage, and for tool/algorithm contribution. The simplistic metric of impact factor and paper writing frenzy our lazy peer-reviewing system (for promotion, tenure, grants and other papers) system uses will kill innovation, and the different way many of us contribute to the scientific endeavor … it is a sad state of affairs!

    1. morungos

      So it could be possible to change the metric. I’m a refugee from information retrieval, where the TREC annual contests drive competition in a different way. First, it provides standardized assessments, which is essential for comparison. This isn’t entirely imposed, but there is a discussion process among participants in each track. Second, certain factors, such as runtime performance, become partly irrelevant (as they probably should) by providing a time limit within which results should be submitted. Performance within that window doesn’t matter, failing to complete does. So, unacceptably slow is out, everything else is in, and is compared by some community-agreed quality score. Finally, it actually encourages approaches based on patching and forking. A good result from one system can be used by a different team. In TREC, it was common for the same basic algorithm to be used with different weighting and wrapped processing by five or so different teams. Often, the same underlying software was used. Incidentally, it also encouraged opening the code so that it could be adapted by different teams. The result is convergence on a set of high quality technologies. Nothing actually prevented developing a new one from scratch, but it had to be very good to compete on a large-scale community-agreed problem. Finally, the TREC approach was agreed by the community and the main interest groups that this was the dominant forum for evaluating techniques. There are many groups that use aligners who could come together to establish a contest like this — after all, we need to find out which is best for our purposes, and the fragmented literature is unhelpful to us too.

  5. Ian Korf

    The problem runs deeper than chasing papers. I had a reviewer criticize my last grant because “anyone who had the sophistication to use [my] library could write it themselves”.

  6. Pingback: Links 1/2/13 | Mike the Mad Biologist

  7. Jim Lund

    The short read alignment problem is not that old–it was reasonable to work up a new one in 2009-2010. I don’t know the history of this paper and the authors, but if this program was developed and essentially complete in 2010, then used in a few projects, 2012 for publication is not a huge delay.

    Perhaps it was written by a grad student who is now finishing up. Or the study that motivated the development of the new program was published first, and now the coder is writing up the software. Or perhaps users are asking about how to cite it, motivating the author to get it published.

    1. biomickwatson Post author

      Indeed, and I agree, there may be specific historical issues about this work that means it comes to the party late, but is still worth publishing.

      I was trying as best as I can not to single out the individual paper – do you think all 70+ algorithms have a similar, historical reason for being published? Do you think the next 20 will also have them?

      I don’t want to cheat anyone out of a publication they no doubt deserve because of the hard work they’ve done, I just want the community to come together and decide that we’re done with new short read aligners, time and money is best spent either improving existing aligners or developing solutions to new problems.

      1. Jim Lund

        I can’t disagree with you, I’m sure some of these programs were bad ideas when they were started. The long project to pub cycle makes it hard draw a line. Also, criteria for publication is something a bit novel, not substantialy better than the current popular tools, a bar pretty easy to pass. And raiasing publication bar will mean fewer projects started, and projects rushed to publication before the door closes, so fewer second or third generation tools. After all, it is only after a dozen or two papers on tools that aren’t very interesting come out that people in the field begin to think the problem is solved well enough.

  8. Pingback: NextGenSeek’s Stories This Week (03/01/13)

  9. Pingback: Clinical Findings from Sequencing keep flowing… | Kromozome

  10. Daniel Zerbino

    “if you contribute to an existing tool, at best your name will be lost in a long list of authors, and at worst you won’t get published at all. However, this argument is a complete fallacy. Take Velvet as an example, [...]”

    Partly true, but you have a sampling bias: you never heard of those who did not get published! For example, Sylvain Forêt spent a *lot* of time in developping the multithreading of Velvet. We submitted a small modest manuscript on this work to a number of journals (Bioinformatics, NAR, maybe a third one, I forget), and we kept on butting against the same argument: this is a technical increment, not a scientific advance (we did not get sent out to reviewers).

    In other words, writing good code is assumed to be equivalent to sweeping your lab, you do it for your own good, no need to publish it. Except that other people might use your code, not your labspace…

    At the risk of repeating what is discussed above, there is the underlying debate of creating incentives for people to adhere to good coding practice so that others can build up on it.

      1. Daniel Zerbino

        I double checked and we did not submit to NAR (we were too discouraged after 2 identical refusals) but to BMC bioinformatics. We did not submit to PLoS ONE. It was a while back though (>2years), it’s too bad we did not think about it.

        However, I’m not too sure how a reviewer (or myself) would react to reviewing such technical changes. The journal/review process is simply not adapted to this type of work.

        Here’s a random idea: it would be great if after you published a main manuscript, you could add small addendums (in the same way that you can already respond to reviewers or send an erratum). After all, scientific projects are not meant to grind to a halt one the paper is accepted. That way, Sylvain could have been added as co-author to the main Velvet paper.

  11. Pingback: An embargo on marginal improvements | Scientific B-sides

  12. Bob Harris

    The original post appears to assume that all these short read aligners are attacking the same problem. That all short read alignment problems are the same and that aligners are interchangeable. In my experience, nearly every alignment problem is different.

  13. Bob Harris

    One variation is the divergence between sample and reference sequence. For many people these are the same species. But the sample may be from a species for which there is no reference genome and the best reference is fairly diverged. For example if the sample is a groundhog, what is the nearest reference? An algorithm that expects only 1 or 2 mismatches in a 100 bp read, and one that assumes sequencing errors are more frequent than mutations will probably not perform well when we expect 10 to 15 mutations.

    Another variation is whether your interest is in the reads that aren’t from the reference, as opposed to the reads that are. If you’re looking for SNPs you probably want the aligner to discard reads you can’t confidently map. If you’re looking for stuff the subject has that the reference does not, you probably want only the stuff that confidently *doesn’t* map. Different sensitivity/specificity thresholds, and conceivably the best algorithm for one isn’t the best for the other.

  14. Pingback: Heliconius Homepage » Blog Archive » Aligning Heliconius short read sequences

  15. Pingback: An embargo on short read alignment tools | Eagle Genomics

  16. Attila Berces

    Let me provide a different perspecitve to the proliferation of NGS tools. The development of YANAT is often driven by the need of the author to solve an intellectual challenge. They do not apply for a grant, they simply allocate their time. My original training is in quantum chemistry and at one point several dozen quantum chemistry methods existed and twenty years after the first method was developed graduate students still started their own new projects. Over time people stopped developing new quantum chemistry methods, but it took a long time.
    However, there are practical reasons why someone can start the development of new methods. Our reason to develop an aligner was to find BRCA mutations where insertions and deletions disproportionately more prevalent and their size can be few hundred nucleotides long. The most successful Burrows-Wheeler methods of the time did not even try to find such variants by design. By the time we were ready some methods appeared that could have been good starting points for our development.
    More recently, we developed yet another method, because we wanted to analyze human leukocyte antigens. These genes are associated with over hundred diseases. They represent the largest human genetic diversity. They include the most polymorphic exons in the human genome, but at the same time the region includes large segmental duplications/ multiplications. In order to interpret the biology we need phase resolution of the alleles. The different haplotypes have highly different structure with several genes and pseudogenes missing from certain haplotypes. In spite of the importance of HLA, none of the referenced seventy alignment software can give reliable results for these genes. Many association studies simply exclude this otherwise highly interesting region since the results are so unreliable. Although there are opens source methods to solve the problem they had some fundamental conceptual problems. We developed one from scratch which has the combination of accuracy, speed, agility that we could have not reached starting with an existing solution.
    Application of NGS for individuals, or diagnostics is an outstanding problem. In contrast to association studies or other comparative studies where a small false discovery rate is acceptable it is not for making individual medical decisions. A single nucleotide mismatch in the in the antigen presenting exons between patient and bone marrow transplant recipient increases mortality rate by 10%. A single SNP error in the relevant thousand nucleotides could translate into over 1000 death and $200M wasted healthcare expenditure in the US annually. Application of any of the seventy alignment software and making the decsion on NGS resuls could lead to half the patients dying. The current alignment and variant calling algorithms are simply not good enough to make certain medical decisions.
    Genomic regions are diverse in haplotype structure, GC content, rate of polymorphisms, and the distribution of variant types. The goal of the study is also diverse whether we are looking for known mutations or any mutation and what type of mutations we are after. For these reasons, I predict that we shall see more specialized methods to solve particular problems. These specialized methods will solve specific medical or diagnostic problem better than any combination of the current seventy alignment algorithms and variant callers.

  17. Pingback: Cats, dogs and bioinformatics | opiniomics

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s