What does the PacBio Sequel mean for the future of sequencing?

PacBio have responded to intense competition in the sequencing market by releasing the Sequel, a machine that promises 7 times the throughput of their last machine (the RSII) at a price of $350k (the RSII cost more in the region of $750k). So they have a cheaper, higher throughput machine (though I haven’t seen cost-per-Gb figures).  That’s around 7 Gigabases of reads averaging around 15Kb in length.  Without doubt this is a very interesting platform and they will sell as many of them as they can produce.  A sweet spot for assembly is still 50-60X for PacBio, so think 12 SMRT cells to get a human genome: let’s say £12k per genome. Edit 01/10/2015 my maths at 7am was not that good!  50X human genome is 150Gb, so that’s 21 SMRT cells and £21K per human genome.   Much more expensive than Illumina’s $1000 genome, but far, far better.

I just want to say that at times I have been accused of being an Illumina- and a nanopore- fanboy; I am neither and both.  I am just a fan of cool technology, from microarray in the 90s to sequencing now.

In long reads, let’s be clear, we are talking about the promise of Oxford Nanopore vs the proven technology of PacBio.  And the Sequel changes the dynamics.  However, the MinION fast mode is capable of throughput in the region of 7Gb (like the Sequel) and the PromethION is capable of throughput on Illumina-scale.   Therefore, Oxford Nanopore are far from dead – though they need to respond.

So how does the new PacBio Sequel change the market?  A lot of initial reactions I have had are that the Sequel is a real threat to Oxford Nanopore.  It certainly ramps up the competition in the long read space, which is a really good thing.  But actually, high-throughput long read machines like the Sequel and the PromethION don’t spell the end for one another – they actually spell the beginning of the end for Illumina – as a sequencing platform.

As soon as you have high-throughput, cheap long reads, it is in fact Illumina who face a problem.  I love Illumina.  When I first arrived at Roslin, I walked into our lab and (honestly!) stroked our Illumina GAIIx.  Illumina have revolutionised biology.  However, short reads have limitations – they are bad for genome assembly, they are bad at complex genomes, they’re actually quite bad at RNA-Seq, they are pretty bad for structural variation, they are bad at haplotypes and SNP phasing, and they are not that great at metagenomics.  What has made Illumina the platform of choice for those applications is scale – but as soon as long read technologies reach a similar scale, Illumina looks like a poor choice.

The Sequel (and the PromethION) actually challenge Illumina – because in an era of cheap, long read sequencing, Illumina becomes a genotyping platform, not a sequencing platform.

Extracting MinION fastq on the command line using poRe

Hopefully most of you will be aware of our software for handling MinION data, poRe, published in bioinformatics this year:

Watson M, Thomson M, Risse J, Talbot R, Santoyo-Lopez J, Gharbi K, Blaxter M. (2015) 
poRe: an R package for the visualization and analysis of nanopore sequencing data. 
Bioinformatics 31(1):114-5.

poRe is based on R; many people consider R to be an interactive package, but it is very easy to write command-line scripts for R in very much the same way as you write Perl or Python scripts.  Below is a script that can be used in conjunction with R and poRe to extract 2D fastq data from a directory full of fast5 files:

#!/usr/bin/env Rscript

# get command line arguments as an array
args <- commandArgs(trailingOnly = TRUE)

# if we don't have any command line arguments
# print out usage
if (length(args) == 0) {
        stop("Please provide a directory containing fast5 files as the first (and only) argument")

# the first argument is the directory, stop if it doesn't exist
my.dir <- args[1]
if (! file.exists(my.dir)) {
        stop(paste(my.dir, " does not exist"))

# get a list of all of the files with a .fast5 file name extension
f5files <- dir(path = my.dir, pattern = "\\.fast5$", full.names = TRUE)

# print to STDERR how many files we are going to attempt
write(paste("Extracting 2D fastq from", length(f5files), "files"), stderr())

# load poRe

# apply printfastq to every entry in f5files
# printfastq tests for 2D fastq and if found prints to STDOUT
invisible(apply(array(f5files), 1, printfastq))

The suppressMessages() function means that none of the output is printed when the poRe library is loaded, and the invisible() function suppresses the natural output of apply() (which would otherwise return an array of undefineds the same length as f5files (which we definitely do not want).

The above is based on R 3.1.2 and poRe 0.9.

We could extract template and complement fastq (respectively) by substituting in the following lines:

# template
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_template"))
# complement
invisible(apply(array(f5files), 1, printfastq, f5path="/Analyses/Basecall_2D_000/BaseCalled_complement"))


Finally, a definition for bioinformatics

Following the trend for ultra-short-but-to-the-point blog posts, I have decided to finally define bioinformatics:

bio-informatics (bi-oh-in-foh-shit-I-don’t-understand-how-that-works)

From the word bio, meaning “of or related to biology” and informatics, meaning “absolutely anything your collaborators or boss don’t understand about maths, statistics or computing, including why they can’t print and how the internet works”

Happy to finally put that one to bed!

Why nanopore sequencing errors aren’t actually errors

Let me start by saying that what I’ve written below are opinions formed after many conversations with many different people, including some from ONT, over the last 2-3 years.  I don’t want anyone to think I am stealing their ideas – but I think it’s important to get them out there.

Oxford Nanopore’s technology, currently available in the MinION device, is the only sequencing technology to directly sequence an actual strand of DNA.  Every other technology sequences incorporation events into a template strand (including Sanger, Illumina, Ion and PacBio) – and invariably those incorporation events involve the canonical bases, A, G, C and T.  However, in nature, DNA is much more complicated than that – there are base modifications and base analogues.  The MinION (and other ONT sequencers) is the only sequencer that directly detects those modifications and analogues.  However, what we do when we “base call” the raw ONT data is to compress it from natural-DNA-space into canonical-AGCT-space, and whenever we compress we lose signal, and that lost signal turns up in the system as noise – as so-called “error”.  People talk about the error rate of the MinION being around 10% – but not all of it is error, some of it is signal, signal for something we don’t quite understand yet so we’re interpreting it (wrongly) as AGCT.

That’s one of the exciting things about nanopore sequencing that no-one is talking about.  Oxford Nanopore’s sequencers could possibly reveal a whole new alphabet for DNA that we didn’t know about before – and my bet is that it’ll explain some of what we currently struggle to understand in genetics and genomics.

Why I’m not jumping on the ORCID bandwagon

Yesterday I had a bit of a lively discussion with many people on Twitter about ORCID – a “new” “standard” for enabling researchers to link themselves, their grants and their impacts in a single place.   If you want to follow the discussion, see replies to this tweet:

I stand by my comment that this is XKCD #927 all over – for me, ORCID is just another place that I need to keep my profile up to date, and it’s not even a very good one.  It’s not the standard, it is just one of many competing standards for presenting my academic life to the world (in addition to those below (PURE; Google Scholar; ResearchFish; EndNote), there is Scopus, My NCBI, LinkedIn, ResearchGate, Mendeley, Microsoft, etc etc etc – the list is almost endless.  And yes I know you think ORCID will put an end to that list, but all of the other systems think the same thing, and that’s the point of that XKCD comic!

It used to be even worse – there used to be a bunch of websites that I also needed to keep my details on, live and up-to-date.  However, through good systems and hard work, I fixed that.  I have all of my data live and up-to-date in a brilliant system, I absolutely love it….and it’s not ORCID

Let me tell you about PURE

OK firstly PURE is brilliant.  Those of you who know me, I am not being sarcastic; you know how infrequently I compliment anything.  But I am happy to compliment PURE.  It is amazing, it does everything I need it to, and it does it really well.

I have used both PURE and ORCID and let me tell you – PURE makes ORCID look like a 5yr old’s summer project.  They are not even comparable, so I don’t know why I am trying.  Whilst PURE is a  Ferrari, ORCID is a push-bike with only one wheel and the chain is broken.

PURE is a research information management system sold by Elsevier, and it is the system chosen by my employer, the University of Edinburgh.  Because I work for them, the University is perfectly within their rights to expect me to keep my PURE record up to date, and I do so with pleasure.

PURE models perfectly the University, Schools, Organisational units within schools, projects and research outputs.  Crucially, these all have many-to-many relationships, which means that I can assign any research output to any number of projects, organisational units and/or shcools.  Also, projects and funding are kept separate (but related) – recognising that it is possible to have projects that do not relate to a particular grant, or which relate to many grants.  I can add press clippings, I can add generic activities, I can add datasets, awards, public engagement activities, book chapters, patents, conference proceedings, my thesis, and all sorts of other things, and I can link them and relate them to my projects, my funding and my school – seamlessly, online and easily.

Mostly the research outputs I record are papers, and here PURE excels – I can import from a number of online sources, and it works EVERY TIME and WITHOUT ERROR:


As you can see, amongst others, this means I can import from PubMed, Scopus, Web of Science, CrossRef and ArXiv.  I can import anything with a DOI.  I can import anything with a pubmed ID.

When I import those records, PURE parses them seemlessly into authors, title, journal, abstract etc etc and all I need do is link the record to the relevant project, funding and school and I am done.  It works amazingly well and it takes no time at all.

When it comes to reporting time, PURE has the concept of report definitions and I can slice and dice my data anyway I see fit, by school, project, funding, date etc all through an amazingly intuitive interface.  I can then download the results in a number of formats; or I can simply choose to export all of my (or my project’s; or my school’s; or my funding’s…) publications as RIS (RefMan) or BibTex format.

Finally, the whole system has an API, which means it drives the Edinburgh Genomics publication list, it drives my Roslin publication list, and it drives Edinburgh Research Explorer.

It rocks.  PURE is really, really good and if your institution doesn’t have it, tell them to buy it, immediately

The nightmare that is ResearchFish

I hate ResearchFish.  Those of you who have been following my Twitter feed closely will remember!

However, ResearchFish is the chosen platform through which the vast majority of my funders want me to report the outcomes of the grants they give me.  Yep, the people who’ve given me millions of pounds of research and capital grants over the last few years want me to report my outputs in ResearchFish.  So of course I am going to.

The quick amonst you will point out that ResearchFish can import from ORCID; well, it’s stalemate I’m afraid, because we (as in Edinburgh) export our PURE records directly to ResearchFish, and this is carried out by others, meaning there is zero effort from me.  It just happens.  Of course, I need to edit the information once in there to make sure it is correct – but doubtless I would need to do the same if I went via ORCID.

Google Scholar

Like many academics, I have a Google Scholar profile, and whilst I dislike that it is not an open platform, I do like that it (generously) tracks citations, and I like that it seamlessly, almost magically, almost spookily, finds every single one of my publications a few days after it comes out, and asks politely if I would like to add it to my profile.  It is almost never wrong, but it sometimes is, so Google Scholar demands a bit of attention to keep it up to date.

Personal Records

Because I have always done it, I maintain an EndNote library with all of my papers (with PDFs) on network attached storage where I work.  It’s pretty easy to maintain, again with simple imports from PubMed and other locations.


So where does this leave ORCID?  In the wilderness I’m afraid.  At the moment, in my head it’s in the same place as ResearchGate, LinkedIn and other places – just another site where I need to remember my log in details, and try in vain to keep my records up to date.  I’d rather not.  I have a public profile of all of my grants, projects and research outputs in Edinburgh Research Explorer, driven by PURE.

Sure, if ORCID becomes a place my employer recognises and starts to use, or which my funders begin to use for reporting – then ORCID becomes a solution rather than a problem.  However, it is not there yet, and just because you really want it to be the standard for sharing academic profiles, it doesn’t mean that it is.

Good luck to ORCID – it would be truly immense if all institutions around the world, including employers and funders, adopted a single system to store academic relationships.  However, the chances of that happening are virtually zero, because institutions and funders want their data just so and may want to store private data alongside the public stuff.

There needs to be significant improvements in the features of ORCID too – better import systems, easier interfaces for keeping data up to date, and an API so that we can drive websites off the data therein.  If anyone from ORCID (or ResearchFish) wants me advice, then I’d tell them to take a look at PURE – it is near perfect.  Of course, with time and effort, perhaps ORCID could becomes as good as PURE – but then, why not just use PURE?



Assembling B fragilis from MinION and Illumina data

You may have seen our bioRxiv preprint about the sequencing and assembly of B fragilis using Illumina + MinION sequence data.  Well, here is how to do it yourself.

First get the data:

# MinION data (raw dast5 data; needs to be extracted)
wget ftp://ftp.sra.ebi.ac.uk/vol1/ERA463/ERA463589/oxfordnanopore_native/FAA37759_GB2974_MAP005_20150423__2D_basecalling_v1.14_2D.tar.gz
mkdir fragilis_minion
tar -xzf FAA37759_GB2974_MAP005_20150423__2D_basecalling_v1.14_2D.tar.gz -C fragilis_minion
rm fragilis_minion/*.md5

# MiSeq data (fastq data)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR973/ERR973713/ERR973713_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR973/ERR973713/ERR973713_2.fastq.gz

Then, within R:


This will extract all sequences as FASTA into fragilis_minion/extracted.  Let’s put all the 2D reads into one file:

cat fragilis_minion/extracted/*.2D.fasta > fragilis_minion.2D.fasta

And finally we are ready to assemble it:

spades.py -o spades_fragilis -1 ERR973713_1.fastq.gz -2 ERR973713_2.fastq.gz --nanopore fragilis_minion.2D.fasta

That’s as far as I can go on my Ubuntu laptop, wiil update when I get to work!

4 predictions about nanopore sequencing in the next 12 months

I have been lucky enough to be involved in Oxford Nanopore’s MinION access programme since it kicked off, and we have an active grant to develop poRe, software we are writing to help scientists access and use MinION sequence data.

Because of the above, I have been lucky enough to work with some amazing people: incredibly driven, intelligent scientists.  Here is my prediction about what ONT, and those intelligent, driven scientists, are going to achieve in the next 12 months (probably sooner):

1. We will see the first full human genome sequenced using only Oxford Nanopore data.  The cost will be comparable to current techniques.

2. Genotyping and consensus accuracy will be very high, more than capable of accurately calling SNVs  (arguably we are there already), and better than other technologies at calling structural variation

3. Nanopore will become the default platform for calling base modifications (5mC, 5hmC etc)

4. All of the above will be possible without seeing a single A, G, C, or T (i.e. it will all be possible without base-calling the data)

Exciting times!