How to extract FASTQ from the new MinION FAST5 format using poRe

A little while ago I demonstrated how to extract FASTQ from MinION FAST5 files using Rscript and the Linux command line.

In that article, I described how to extract the different FASTQ data using a parameter:

# 2D is the default
invisible(apply(array(f5files), 1, printfastq))
# template
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_template"))
# complement
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_complement"))

The f5path parameter is the address of the FASTQ data within the FAST5 file, and that’s been changed recently by nanopore.  Well, we can very easily simply edit our scripts to cope with the new format:

# 2D is the default
invisible(apply(array(f5files), 1, printfastq))
# template
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_1D_000/BaseCalled_template"))
# complement
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_1D_000/BaseCalled_complement"))

And that’s it! No new version of poRe required, simply a couple of edits to scripts – example scripts can be found on github.

The five habits of bad bioinformaticians

When ever I see bad bioinformatics, a little bit of me dies inside, because I know there is ultimately no reason for it to have happened.  As a community, bioinformaticians are wonderfully open, collaborative and helpful.  I genuinely believe that most problems can be fixed by appealing to a local expert or the wider community.  As Nick and I pointed out in our article, help is out there in the form of SeqAnswers and Biostars.

I die even more when I find that some poor soul has been doing bad bioinformatics for a long time.  This most often happens with isolated bioinformatics staff, often young, and either on their own or completely embedded within a wet-lab research group.  I wrote a blog post about pet bioinformaticians in 2013 and much of the advice I gave there still stands today.

There are so many aspects of bioinformatics that many wet lab PIs are simply incapable of managing them.  This isn’t a criticism.  There are very few superheros; few people who can actually span the gap of wet- and dry- science competently.  So I thought I’d write down the 5 bad habits of bad bioinformaticians.  If you are a pet bioinformatician, read these and figure out if you’re doing them.  If you’re a wet lab PI managing bioinformatics staff, try and find out if they have any of these habits!

Using inappropriate software

This can take many forms, and the most common is out-of-date software.  Perhaps they still use Maq, when BWA and other tools are now far more accurate; perhaps the software is a major version out of date (e.g. Bowtie instead of Bowtie2).  New releases come out for a reason, and major reasons are (i) new/better functionality; (ii) fixing bugs in old code.  If you run out of date software, you are probably introducing errors into your workflow; and you may be missing out on more accurate methods.

Another major cause of inappropriate software use is that people often use the software they can install rather than the software that is best for the job.  Sometimes this is a good reason, but most often it isn’t.  It is worth persevering – if the community says that a tool is the correct one to use, don’t give up on it because it won’t install.

Finally, there are just simple mistakes – using a non-spliced aligner for RNA-Seq, software that assumes the wrong statistical model (e.g. techniques that assume a normal distribution used on counts data) etc etc etc.

These aren’t minor annoyances, they can result in serious mistakes, and embed serious errors in your data which can result in bad science being published.  Systematic errors in analysis pipelines can often look like real results.

Not keeping up to date with the literature

Bioinformatics is a fast moving field and it is important to keep up to date with the latest developments.  A good recent example is RNA-Seq – for a long time, the workflow has been “align to the genome, quantify against annotated genes”.  However, there is increasing evidence that the alignment stage introduces a lot of noise/error, and there are new alignment free tools that are both faster and more accurate.  That’s not to say that you must always work with the most bleeding edge software tools, but there is a happy medium where new tools are compared to existing tools and shown to be superior.

Look for example here, a paper suggesting that the use of SAMtools for indel discovery may come with a 60% false discovery rate.  60%!  Wow….. of course that was written in 2013, and in 2014 a comparison of a more recent version of SAMtools shows a better performance (though still an inflated false positive rate).

Bioinformatics is a research subject.  It’s complicated.

All of this feeds back to point (1) above.  Keeping up to date with the literature is essential, otherwise you will use inappropriate software and introduce errors.

Writing terrible documentation, or not writing any at all

Bioinformaticians don’t document things in the same way wet lab scientists do.  Wet lab scientists keep lab books, which have many flaws; however they are a real physical thing that people are used to dealing with.  Lab books are reviewed and signed off regularly.  You can tell if things have not been documented well using this process.

How do bioinformaticians document things?  Well, often using things like readme.txt files, and on web-based media such as wikis or github.  My experience is that many bioinformaticians, especially young and inexperienced ones, will keep either (i) terrible or (ii) no notes on what they have done.  They’re equally as bad as one another.  Keeping notes, version controlled notes, on what happened in what order, what was done and by whom, is essential for reproducible research.  It is essential for good science.  If you don’t keep good notes, then you will forget what you did pretty quickly, and if you don’t know what you did, no-one else has a chance.

Not writing tests

Tests are essential.  They are the controls (positive and negative) of bioinformatics research and don’t just apply to software development.  Bad bioinformaticians don’t write tests.  As a really simple example, if you are converting a column of counts to a column of percentages you may want to sum the percentages to make sure they sum to 100.  Simple but it will catch errors.  You may want to find out the sex of all of the samples you are processing and make sure they map appropriately to the signal from sex chromosomes.  There are all sorts of internal tests that you can carry out when performing any analysis, and you must implement them, otherwise errors creep in and you won’t know about it.

Rewriting things that exist already

As Nick and I said, “Someone has already done this, find them!”.  No matter what problem you are working on, 99.9999% of the time someone else has already encountered it and either (i) solved it, or (ii) is working on it.  Don’t re-invent the wheel.  Find the work that has already been done and either use it or extend it.  Try not to reject it because it’s been written in the “wrong” programming language.

My experience is that most bioinformaticians, left to their own devices, will rewrite everything themselves in their preferred programming language and with their own quirks.  This is not only a waste of time, it is dangerous, because they may introduce errors or may not consider problems other groups have encountered and solved.  Working together in a community brings safety, it brings multiple minds to the table to consider and solve problems.


There we are.  That’s my five habits of bad bioinformaticians.  Are you doing any of these?  STOP.  Do you manage a bioinformatician?  SEND THEM THIS BLOG POST, and see what they say ;-)

We definitely do not need a “Steve Jobs” in biology

If you follow me on Twitter, you cannot help but know that I am not Steve Jobs’ biggest fan – in fact I have frequently called him the most over-rated man in history.  I stand by that.  Steve Jobs is just a common garden billionaire capitalist.  No better or worse than any other billionaire capitalist.  As such he deserves some respect and I give him that – well done, Steve, for taking things and selling enough of them to make yourself a billionaire.  It’s a feat that deserves respect – I couldn’t do it.  Why then is he idolized?   I don’t have all of the answers but I suspect it has something to do with the fact he ran a company that designed cool-looking consumer devices that, for a minority, defined them and defined their entire lives.

So why am I writing about Steve Jobs?

Because Eric Schadt said the biggest need in biology today is a “Steve Jobs”.

I was pretty astounded when I read that, because Eric is an intelligent man.  So why the ridiculous statement?  Let’s look at this a bit more.

What did Steve Jobs actually do?

Steve Wozniak, who designed the technology behind early Apple devices, has said that Steve Jobs wasn’t interested in technology, he just wanted to be important.  Sour grapes?  Perhaps; though perhaps Steve was the real creative genius and resents the fact Steve got all of the credit?

Others have analysed what Steve Jobs actually did, (many ask the question), and I quote from Ian McCullough below:

He was central to the mass marketing and consumerization of the personal computer — but he by no means invented it. That credit most directly goes to people like Ivan Sutherland, Alan Kay, Douglas Engelbart, and the people who worked at Xerox PARC and the Stanford Research Institute during the 1960’s and 1970’s. Let us also remember that he didn’t invent the portable digital music player. Looking past the Sony Walkman which first met the general desire to carry music with you, the first known mention of a digital music player is Kane Kramer’s IXI in 1981. In the US, the Diamond Rio was the first widely available portable MP3 player in the late ’90s. The iPod didn’t come out until 2001. How about multi-touch screens? It’s a big “no” there too. That was Bent Stumpe & Frank Beck and later Bell Labs at Murray Hill, with some early experimentation from musicians like Huge Le Caine and Bob Moog. And what of Pixar (company)? He actually bought what became known as Pixar from none other than George Lucas in 1986. It was Ed Catmull that drove the Pixar technology side and ultimately John Lasseter (director) that drove the creative side. (Steve does get credit there for cultivating the business, building the The Walt Disney Company (company) relationship, and – according to all third-party reports – smartly staying out of the way.) To be completely fair: as far as I know he never actually laid claim to these accomplishments, but he is so tightly associated with them in the popular mind that many people make the errors.

Steve Jobs was a salesman of unparalleled natural talent, an astute curator, and a very tough editor. It is clear that he made the ideas and products that he came into contact with significantly better and extremely desirable to consumers, but he did not originate those things.

So, from what I can gather, Steve Jobs took what other people invented, made it cool and desirable and sold it for great profit.  Remind me: why do we need that in biology?

What did Eric Schadt actually say?

Eric’s visions, and it is one that I and many others share, is that if you collect enough relevant information, both scientific data but also patient data, and you include “additional” data feeds such as microbiome and the local environment, then we will be able to build predictive models that improve the treatment of disease in humans.  The models can and will be refined by experimental validation which will feed back into and improve the models.

That’s cool, and as I said, many people share this vision.  In fact, in many ways this describes molecular biology research of the last 10-15 years.

Eric goes on to say that we need to put these models in the hands of doctors who can work with them, who don’t need to understand the models, but who need to be able to use them.  And that’s where he brings Steve Jobs into the equation, and here is what he said:

“I think the number one need in biology today is a Steve Jobs. Where is the Steve Jobs of biology who can lead the design of amazing, intuitive interfaces into these complex data?”

Wait, wut?  Firstly, what amazing intuitive interface is he talking about?  Angry Birds?  The iPod wheel?  Touch screens (that neither Steve Jobs nor Apple invented)?  A high definition screen? Aluminium?  Icons?  Is talking about iOS icons?   Does Eric think that what biology needs is a set of cartoonish icons designed for toddlers?


Back to what Eric said: do we need these models?  Absolutely, of course we do!  Do we need doctors to be able to use them?  Damn right!  Do we need an amazing intuitive interface to them?  Absolutely.  Sign me up!  Did Steve Jobs ever invent such an interface?  Of course he fucking didn’t, he’s a fucking salesman!

Do we need an amazing salesman who steals your personal data in biology/medicine?  FUCK NO!

What else did Eric Schadt say?

He said a bunch of stuff that I agree with.  He’s hiring intelligent people from machine learning and high-performance computing, network modellng etc and I think that’s a great idea (though hiring some dude from Facebook also raised my hackles slightly).  He talks about open-ness and data sharing and the problems he faces with a sense of data ownership.

I mean, all in all I agree with Eric – apart from that dumb reference to Steve Jobs!

An inconvenient truth

There are several inconvenient truths in biology today that might prevent Eric and others’ vision coming true.  However here are the main two:

  1. We cannot accurately measure everything we need to measure.  We don’t survey the whole of the genome.  Proteins and other small molecules we are not able to measure/quantify accurately.  Our knowledge of epigenetics, 3D genome structure, the action of enhancers, ncRNA, microbiome etc etc is not refined enough.  Put simply, we don’t have all of the data that the model needs, and nor can we easily get it
  2. Once we do have all of the data, the most intelligent machine on the planet, the human brain, will be incapable of understanding it all.  It’s high-dimensional data that we as humans simply can’t understand.  Jun Wang (or Wang Jun as I know him) is right when he says that artificial intelligence is the future of genomics – the only thing that can genuinely understand the complexity that is a living cell, or living organism, is an intelligence far greater than our own.  The problem is AI is nowhere near this capability yet.

I’m not saying that machine learning, models and “big data” won’t advance knowledge and insight; it may even provide us with additional treatments.  But 1000s of diseases will remain uncured until such time as “we” can understand biology, and we’re quite a long way from that at the moment.  Credit to Eric for trying.

A last word on Steve Jobs

Imagine a world with two billionaires, both whom made their money by revolutionizing the computing and electronic devices world.  One of them sets up the largest transparent private foundation in the world, which aims to enhance healthcare and reduce extreme poverty globally.  The other one dies.

Which one do we need in biology?

What I know about the expansion of HiSeq X to non-human genomes

Update 15:31:

Well, it isn’t very much…..

Illumina announced that they will now allow researchers to sequence non-human genomes on the HiSeq X, and well we have HiSeq X systems, so we have a tiny bit of knowledge gleaned from our communications with them.

What they say is:

The updated rights of use will allow for market expansion and population-scale sequencing of non-human species in a variety of markets, including plants and livestock in agricultural research and model organisms in pharmaceutical research. Previously, it has been cost prohibitive to sequence non-human genomes at high coverage.

This is a bit vague, but there are a few keywords in there – “genomes” and “high coverage”.

My opinion is that this was done after pressure from the mouse and rat model-organism communities, who want to do thousands of 3Gb 30X genomes, just not human genomes.  We (as in Roslin) have also been putting on pressure to enable us to do farm animal genomes.

It’s only genomes: so not RNA-Seq, ChIP-Seq, exomes, amplicons etc – none of that.  Just genomes.

It’s only high coverage: 30X or above.  So no 10X, 5X or 1X genomes.

We asked about metagenomics (WGS): maybe (Update 08/10/2015 metagenomics is not supported)

We asked about smaller genomes (100Mb): (update 12/10/2015) small genomes are allowed, the issue is that only 12 indexes are available per lane – each lane producing around 120Gbase of data.  That’s 10Gbase per index (or 2000-fold coverage of a 5Mb bacterial genome)

We asked about medium sized genomes (1Gb): maybe (how a 30X 1Gb genome is different to a 10X 3Gb genome is anyone’s guess, but apparently, maybe, it is)

Bisulfite sequencing: my understanding is that this is supported/allowed (Update 08/10/2015)

Having said all that – my opinion is that what Illumina want to stimulate are massive, large-scale genomics projects.  Think on a scale of 1000s of 30X human genomes (90 terabases).  If you have fly, yeast, C elegans, metagenomics or small genome projects that are on that scale, I should think Illumina would be very happy to talk to you about getting them sequenced on an X somewhere.  However, if you’re smaller scale than that, well – we have the HiSeq 4000 systems at Edinburgh Genomics, which offer incredible value for money.


What does the PacBio Sequel mean for the future of sequencing?

PacBio have responded to intense competition in the sequencing market by releasing the Sequel, a machine that promises 7 times the throughput of their last machine (the RSII) at a price of $350k (the RSII cost more in the region of $750k). So they have a cheaper, higher throughput machine (though I haven’t seen cost-per-Gb figures).  That’s around 7 Gigabases of reads averaging around 15Kb in length.  Without doubt this is a very interesting platform and they will sell as many of them as they can produce.  A sweet spot for assembly is still 50-60X for PacBio, so think 12 SMRT cells to get a human genome: let’s say £12k per genome. Edit 01/10/2015 my maths at 7am was not that good!  50X human genome is 150Gb, so that’s 21 SMRT cells and £21K per human genome.   Much more expensive than Illumina’s $1000 genome, but far, far better.

I just want to say that at times I have been accused of being an Illumina- and a nanopore- fanboy; I am neither and both.  I am just a fan of cool technology, from microarray in the 90s to sequencing now.

In long reads, let’s be clear, we are talking about the promise of Oxford Nanopore vs the proven technology of PacBio.  And the Sequel changes the dynamics.  However, the MinION fast mode is capable of throughput in the region of 7Gb (like the Sequel) and the PromethION is capable of throughput on Illumina-scale.   Therefore, Oxford Nanopore are far from dead – though they need to respond.

So how does the new PacBio Sequel change the market?  A lot of initial reactions I have had are that the Sequel is a real threat to Oxford Nanopore.  It certainly ramps up the competition in the long read space, which is a really good thing.  But actually, high-throughput long read machines like the Sequel and the PromethION don’t spell the end for one another – they actually spell the beginning of the end for Illumina – as a sequencing platform.

As soon as you have high-throughput, cheap long reads, it is in fact Illumina who face a problem.  I love Illumina.  When I first arrived at Roslin, I walked into our lab and (honestly!) stroked our Illumina GAIIx.  Illumina have revolutionised biology.  However, short reads have limitations – they are bad for genome assembly, they are bad at complex genomes, they’re actually quite bad at RNA-Seq, they are pretty bad for structural variation, they are bad at haplotypes and SNP phasing, and they are not that great at metagenomics.  What has made Illumina the platform of choice for those applications is scale – but as soon as long read technologies reach a similar scale, Illumina looks like a poor choice.

The Sequel (and the PromethION) actually challenge Illumina – because in an era of cheap, long read sequencing, Illumina becomes a genotyping platform, not a sequencing platform.

Extracting MinION fastq on the command line using poRe

Hopefully most of you will be aware of our software for handling MinION data, poRe, published in bioinformatics this year:

Watson M, Thomson M, Risse J, Talbot R, Santoyo-Lopez J, Gharbi K, Blaxter M. (2015) 
poRe: an R package for the visualization and analysis of nanopore sequencing data. 
Bioinformatics 31(1):114-5.

poRe is based on R; many people consider R to be an interactive package, but it is very easy to write command-line scripts for R in very much the same way as you write Perl or Python scripts.  Below is a script that can be used in conjunction with R and poRe to extract 2D fastq data from a directory full of fast5 files:

#!/usr/bin/env Rscript

# get command line arguments as an array
args <- commandArgs(trailingOnly = TRUE)

# if we don't have any command line arguments
# print out usage
if (length(args) == 0) {
        stop("Please provide a directory containing fast5 files as the first (and only) argument")

# the first argument is the directory, stop if it doesn't exist
my.dir <- args[1]
if (! file.exists(my.dir)) {
        stop(paste(my.dir, " does not exist"))

# get a list of all of the files with a .fast5 file name extension
f5files <- dir(path = my.dir, pattern = "\\.fast5$", full.names = TRUE)

# print to STDERR how many files we are going to attempt
write(paste("Extracting 2D fastq from", length(f5files), "files"), stderr())

# load poRe

# apply printfastq to every entry in f5files
# printfastq tests for 2D fastq and if found prints to STDOUT
invisible(apply(array(f5files), 1, printfastq))

The suppressMessages() function means that none of the output is printed when the poRe library is loaded, and the invisible() function suppresses the natural output of apply() (which would otherwise return an array of undefineds the same length as f5files (which we definitely do not want).

The above is based on R 3.1.2 and poRe 0.9.

We could extract template and complement fastq (respectively) by substituting in the following lines:

# template
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_template"))
# complement
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_complement"))


Finally, a definition for bioinformatics

Following the trend for ultra-short-but-to-the-point blog posts, I have decided to finally define bioinformatics:

bio-informatics (bi-oh-in-foh-shit-I-don’t-understand-how-that-works)

From the word bio, meaning “of or related to biology” and informatics, meaning “absolutely anything your collaborators or boss don’t understand about maths, statistics or computing, including why they can’t print and how the internet works”

Happy to finally put that one to bed!