When ever I see bad bioinformatics, a little bit of me dies inside, because I know there is ultimately no reason for it to have happened. As a community, bioinformaticians are wonderfully open, collaborative and helpful. I genuinely believe that most problems can be fixed by appealing to a local expert or the wider community. As Nick and I pointed out in our article, help is out there in the form of SeqAnswers and Biostars.
I die even more when I find that some poor soul has been doing bad bioinformatics for a long time. This most often happens with isolated bioinformatics staff, often young, and either on their own or completely embedded within a wet-lab research group. I wrote a blog post about pet bioinformaticians in 2013 and much of the advice I gave there still stands today.
There are so many aspects of bioinformatics that many wet lab PIs are simply incapable of managing them. This isn’t a criticism. There are very few superheros; few people who can actually span the gap of wet- and dry- science competently. So I thought I’d write down the 5 bad habits of bad bioinformaticians. If you are a pet bioinformatician, read these and figure out if you’re doing them. If you’re a wet lab PI managing bioinformatics staff, try and find out if they have any of these habits!
Using inappropriate software
This can take many forms, and the most common is out-of-date software. Perhaps they still use Maq, when BWA and other tools are now far more accurate; perhaps the software is a major version out of date (e.g. Bowtie instead of Bowtie2). New releases come out for a reason, and major reasons are (i) new/better functionality; (ii) fixing bugs in old code. If you run out of date software, you are probably introducing errors into your workflow; and you may be missing out on more accurate methods.
Another major cause of inappropriate software use is that people often use the software they can install rather than the software that is best for the job. Sometimes this is a good reason, but most often it isn’t. It is worth persevering – if the community says that a tool is the correct one to use, don’t give up on it because it won’t install.
Finally, there are just simple mistakes – using a non-spliced aligner for RNA-Seq, software that assumes the wrong statistical model (e.g. techniques that assume a normal distribution used on counts data) etc etc etc.
These aren’t minor annoyances, they can result in serious mistakes, and embed serious errors in your data which can result in bad science being published. Systematic errors in analysis pipelines can often look like real results.
Not keeping up to date with the literature
Bioinformatics is a fast moving field and it is important to keep up to date with the latest developments. A good recent example is RNA-Seq – for a long time, the workflow has been “align to the genome, quantify against annotated genes”. However, there is increasing evidence that the alignment stage introduces a lot of noise/error, and there are new alignment free tools that are both faster and more accurate. That’s not to say that you must always work with the most bleeding edge software tools, but there is a happy medium where new tools are compared to existing tools and shown to be superior.
Look for example here, a paper suggesting that the use of SAMtools for indel discovery may come with a 60% false discovery rate. 60%! Wow….. of course that was written in 2013, and in 2014 a comparison of a more recent version of SAMtools shows a better performance (though still an inflated false positive rate).
Bioinformatics is a research subject. It’s complicated.
All of this feeds back to point (1) above. Keeping up to date with the literature is essential, otherwise you will use inappropriate software and introduce errors.
Writing terrible documentation, or not writing any at all
Bioinformaticians don’t document things in the same way wet lab scientists do. Wet lab scientists keep lab books, which have many flaws; however they are a real physical thing that people are used to dealing with. Lab books are reviewed and signed off regularly. You can tell if things have not been documented well using this process.
How do bioinformaticians document things? Well, often using things like readme.txt files, and on web-based media such as wikis or github. My experience is that many bioinformaticians, especially young and inexperienced ones, will keep either (i) terrible or (ii) no notes on what they have done. They’re equally as bad as one another. Keeping notes, version controlled notes, on what happened in what order, what was done and by whom, is essential for reproducible research. It is essential for good science. If you don’t keep good notes, then you will forget what you did pretty quickly, and if you don’t know what you did, no-one else has a chance.
Not writing tests
Tests are essential. They are the controls (positive and negative) of bioinformatics research and don’t just apply to software development. Bad bioinformaticians don’t write tests. As a really simple example, if you are converting a column of counts to a column of percentages you may want to sum the percentages to make sure they sum to 100. Simple but it will catch errors. You may want to find out the sex of all of the samples you are processing and make sure they map appropriately to the signal from sex chromosomes. There are all sorts of internal tests that you can carry out when performing any analysis, and you must implement them, otherwise errors creep in and you won’t know about it.
Rewriting things that exist already
As Nick and I said, “Someone has already done this, find them!”. No matter what problem you are working on, 99.9999% of the time someone else has already encountered it and either (i) solved it, or (ii) is working on it. Don’t re-invent the wheel. Find the work that has already been done and either use it or extend it. Try not to reject it because it’s been written in the “wrong” programming language.
My experience is that most bioinformaticians, left to their own devices, will rewrite everything themselves in their preferred programming language and with their own quirks. This is not only a waste of time, it is dangerous, because they may introduce errors or may not consider problems other groups have encountered and solved. Working together in a community brings safety, it brings multiple minds to the table to consider and solve problems.
There we are. That’s my five habits of bad bioinformaticians. Are you doing any of these? STOP. Do you manage a bioinformatician? SEND THEM THIS BLOG POST, and see what they say ;-)