Assembling B fragilis from MinION and Illumina data

You may have seen our bioRxiv preprint about the sequencing and assembly of B fragilis using Illumina + MinION sequence data.  Well, here is how to do it yourself.

First get the data:

# MinION data (raw dast5 data; needs to be extracted)
mkdir fragilis_minion
tar -xzf FAA37759_GB2974_MAP005_20150423__2D_basecalling_v1.14_2D.tar.gz -C fragilis_minion
rm fragilis_minion/*.md5

# MiSeq data (fastq data)

Then, within R:


This will extract all sequences as FASTA into fragilis_minion/extracted.  Let’s put all the 2D reads into one file:

cat fragilis_minion/extracted/*.2D.fasta > fragilis_minion.2D.fasta

And finally we are ready to assemble it: -o spades_fragilis -1 ERR973713_1.fastq.gz -2 ERR973713_2.fastq.gz --nanopore fragilis_minion.2D.fasta

That’s as far as I can go on my Ubuntu laptop, wiil update when I get to work!

4 predictions about nanopore sequencing in the next 12 months

I have been lucky enough to be involved in Oxford Nanopore’s MinION access programme since it kicked off, and we have an active grant to develop poRe, software we are writing to help scientists access and use MinION sequence data.

Because of the above, I have been lucky enough to work with some amazing people: incredibly driven, intelligent scientists.  Here is my prediction about what ONT, and those intelligent, driven scientists, are going to achieve in the next 12 months (probably sooner):

1. We will see the first full human genome sequenced using only Oxford Nanopore data.  The cost will be comparable to current techniques.

2. Genotyping and consensus accuracy will be very high, more than capable of accurately calling SNVs  (arguably we are there already), and better than other technologies at calling structural variation

3. Nanopore will become the default platform for calling base modifications (5mC, 5hmC etc)

4. All of the above will be possible without seeing a single A, G, C, or T (i.e. it will all be possible without base-calling the data)

Exciting times!

Why FPKM makes sense

Simulating the perfect scenario

Let’s get our hands dirty in R.  I’m going to simulate 10 genes, each between 200 and 10000 nucleotides long and with between 0 and 50000 molecules in the mix:

gmat <- data.frame(id=paste("G",1:10,sep=""),
 length=as.integer(runif(10, min=200, max=10001)), 
 nmol=as.integer(runif(10, min=0, max=50001)),

This is what my data look like

> gmat
    id length  nmol
1   G1   5432  3206
2   G2   7177 43621
3   G3   6189 48670
4   G4   8529   635
5   G5   2576 27919
6   G6   8462 33280
7   G7   4316 18735
8   G8   3600 41457
9   G9   4973 21935
10 G10   9742 26207

What I want to do now is calculate the total amount of sequence space each gene represents, and use that to calculate a probability of RNA-Seq reads coming from that gene.  A simple example: if there are 10 copies of a gene in the mix, each of 1000 nucleotides long, then the sequence space of that gene is 10,000 nucleotides.

gmat$sspace <- gmat$length * gmat$nmol
ssum <- sum(gmat$sspace)
gmat$p <- gmat$sspace/ssum

My data now look like this:

> gmat
    id length  nmol    sspace          p
1   G1   5432  3206  17414992 0.01098634
2   G2   7177 43621 313067917 0.19750063
3   G3   6189 48670 301218630 0.19002544
4   G4   8529   635   5415915 0.00341666
5   G5   2576 27919  71919344 0.04537072
6   G6   8462 33280 281615360 0.17765861
7   G7   4316 18735  80860260 0.05101114
8   G8   3600 41457 149245200 0.09415216
9   G9   4973 21935 109082755 0.06881546
10 G10   9742 26207 255308594 0.16106284

My p-values should sum to one, and they do:

> sum(gmat$p)
[1] 1

Now, let’s assume that we have 30 million RNA-Seq reads, and that those reads have an equal chance of coming from any of the copies of any of the genes in the mix.  Therefore, 30million multiplied by my p-values will give me my RNA-Seq counts:

gmat$nreads <- round(30000000 * gmat$p)

And I can now calculate my FPKM values:

gmat$fpkm <- (gmat$nreads / (gmat$length / 1000)) / 30

My data now look like this:

> gmat
    id length  nmol    sspace          p  nreads       fpkm
1   G1   5432  3206  17414992 0.01098634  329590  2022.5209
2   G2   7177 43621 313067917 0.19750063 5925019 27518.5509
3   G3   6189 48670 301218630 0.19002544 5700763 30703.7388
4   G4   8529   635   5415915 0.00341666  102500   400.5941
5   G5   2576 27919  71919344 0.04537072 1361121 17612.8500
6   G6   8462 33280 281615360 0.17765861 5329758 20994.8719
7   G7   4316 18735  80860260 0.05101114 1530334 11819.0767
8   G8   3600 41457 149245200 0.09415216 2824565 26153.3805
9   G9   4973 21935 109082755 0.06881546 2064464 13837.8180
10 G10   9742 26207 255308594 0.16106284 4831885 16532.8309

Satisfyingly, I have a nice linear relationship between the number of molecules in the mix, and my calculated FPKM:


We can calculate the relationship between these (forcing the model through the origin):

> lm(nmol ~ 0 + fpkm, data = gmat)

lm(formula = nmol ~ 0 + fpkm, data = gmat)


So, in our model, nmol = 1.585 * fpkm, and we can add this line to the plot:

p + geom_point() + geom_abline(intercept=0, slope=1.585, col="red")


(please do not use this formula more generically – it really only works in this simple simulation!)

So for perfect simulated data, the number of molecules (transcripts) in the original mix correlates perfectly with FPKM.

(unless I have made any huge errors or incorrect assumptions, which is possible)

101 bioinformatics facts

I am trying to crowd-source 101 fun bioinformatics facts!  Please contribute in the comments and I will add them below:

  1. BLAST is so fast, the authors had to deliberately slow down the code so it doesn’t overheat the servers
  2. GCG, the old bioinformatics package, was named after the authors kept high-fiving each other, shouting “good code guys!”
  3. Bowtie is named so because “it is almost impossible to tie”, referring to code to avoid a “race condition” when using multiple processors
  4. TopHat is named do because it was the first spliced RNA-Seq aligner, and when it worked first time, the authors shouted “Top that!”
  5. Over 1 billion people have searched the NCBI protein database for their own name
  6. The EBI is an elaborate front-end to NCBI services
  7. The SRA (short read archive) is the best known of the archives, and not many people know or use the MRA (medium read archive), the KLRA (kinda long read archive) and the LRA (long read archive)
  8. Europe PubMed Central has only ever been accessed by people accidentally clicking on links.  100% of visitors immediately bounce to
  9. There are now more journals than papers
  10. The HGAP assembler is actually an elaborate front-end hiding three thousand slave labourers all running GAP4 (via @IanGoodhead)
  11. HPC actually means “Homunculus Powered Computing”, and all servers are actually just mechanical turks full of leprechauns (via @froggleston)
  12. The group of LUMC is doing psipred protein folding on 1kWh household radiators (128 cores each) (via Eric Feliksik)
  13. The ‘p’ in p-value actually stands for p-otentially interesting! (via )
  14. Velvet is so named because @dzerbino wore velvet gloves when coding it (via @pathogenomenick)
  15. The @PacBio machines are so large because inside’s an Illumina machine + a bioinformatician running assemblies (via )
  16. The Cloud is actually just a cloud. That’s it. A real cloud (via @froggleston)
  17. If you plug in a @nanopore MinION and hit left,right,up,up,A,B,down, it’ll transform into a lifesize statue of @Clive_G_Brown (via @froggleston)
  18. DDBJ have their data centre in a volcano, and are basically a front for Osato Chemicals (SPECTRE) (via @SCEdmunds)
  19. Hidden Markov Models were initially developed to find Waldo (via )
  20. 99.5% of people who cite Altschul et al have never read the paper
  21. Bioinformatics Applications Notes have to be automatically generated by the software they describe
  22. BGI exclusively publish in Nature journals because their papers are first rejected by Gigascience
  23. BGI actually only have one HiSeq but made to look like hundreds by a set up of mirrors, like that bit in Enter the Dragon (via @froggleston)
  24. the consumption rate of coffee (+ beer 🍻) among Bioinformaticians from around the world is increasing every year. TRUE FACT! (via @NazeefaFatima)
  25. EBI” actually stands for “European bureau of investigation”. It’s a front of the EU secret service, collecting genomic info (via @klmr)
  26. There are only 3 facts in 101 (via @mcaccamo)
  27. If all you have is a hmmer everything looks like it can be resolved with Viterbi (via @mcaccamo)
  28. Hidden Markov Models are like the recipe for Kentucky Fried Chicken.  There are only three people in the world who understand small parts of how HMMs work, and only when they get together do they know the full picture
  29. The “e” in e-value stands for “excellent”, as in “that’s an excellent BLAST hit”
  30. The Burrows-Wheeler transform, used in BWA and Bowtie, saves memory by transforming the DNA sequence data into a parallel dimension, meaning it ceases to exist in 4D space/time in this Universe
  31. Base qualities are called “Phred” scores in honour of Fred Sanger who developed DNA sequencing. #101bioinfofunfacts (via @tostenseemann)
  32. In a recent public survey of the 100 most desirable jobs, bioinformatician was a close second to astronaut (via @dynomics)
  33. Heng Li writes all his code in x86 assembly language, and uses a C decompiler before releasing it. @lh3lh3 (via @torstenseemann)
  34. The EBI secretly funds the Perl Foundation to ensure its legacy internal software infrastructure won’t collapse (via @torstenseemann)
  35. Illumina reads are short as before the development of Basespace they were delivered via Twitter (via @RoyChaudhuri)
  36. Pet Bioinformaticians are paid with cuddling #101bioinfofunfacts (via )
  37. Python was conceived in the 1980’s by @gvanrossum & named after his favourite British comedy, Monty Python’s Flying Circus (via )
  38. the word “ELVIS” appears 35 times in human peps (GRch38). “ELVISLIVES” appears 0 times. The king has left the genome #slowday (via @rdemes)
  39. Tuxedo suit is so named that only ‘privileged’ know how to use it ! #bioinformaticsfun (via )
  40. It’s easy! You only have to download this database in which all the genes have only one ID and you can retrieve the IDs in the most important databases (via @jorjial)
  41. If you stand in front of a mirror and say ‘HiSeq’ 3 times, Illumina staff member will show up holding the HiSeq X Ten system (via @nazeetafatima)
  42. @BenLangmead wrote Bowtie while wearing a tuxedo but he did all the testing in zip-up onesie batman pajamas (via @coletrapnell)
  43. Spike-ins are like gold (via @nomad421)
  44. Do you need more hard disk space to store and do the analysis? Sure! Let’s buy 10 hard disk of 3 TB in the supermarket (via @jorjial)
  45. This could be the basis for 10.1 papers in PLOS Comp. Biol. (via @kbradnam)
  46. All bioinformatics problems can be solved through the medium of twitter, snide and ranting ;) (via @guyleonard)
  47. Installing TopHat with option –reverse will install HotTap, a program that spews vapid results on a random science hot topic (via @CamLBerthelot)
  48. SOLiD sequencers generated colour-space sequence using an algorithm based on the once popular “Simon Says” hand held game (via @iandcalling)
  49. CriMap was called CriMap because users do an awful lot of crying before they get a half decent map (via @dj_de_koning)
  50. A single anonymous donor, RP11, accounts for 72 percent of the human reference genome (via CanGenom)
  51. If you amass the de-bugging tears of a bioinformatician it is enough to fill an Olympic size swimming pool annually (via @paulhoskisson)
  52. FASTA 80 character line wrapping was invented to standardise data sharing using MS Word (via @IanGoodhead)
  53. nine out of ten Bioinformaticians prefer Excel (via
  54. if you’ve never shown the NIH sequencing costs plot in talk/lecture you’re not a real bioinformatician (via @AliciaOshlack)
  55. Illumina is short for Illuminati, the shadowy organisation that controls sequencing worldwide. (via @neilfws)
  56. Every time you run a closed source bioinformatics tool, a PhD student’s soul is sacrificed to the Blood God. (via @froggleston)
  57. The number of replicates needed for your RNA-seq experiment equals the impact factor of the journal you want to publish in (via @torstenseemann)
  58. NCBI’s bacterial annotation takes 6 weeks because it’s done manually by work experience students pasting ORFs into web BLAST (via @torstenseemann)
  59. The majority of bioinformaticians can’t pronounce “de Bruijn” properly (see also… @torstenseemann) (via @rvaerle)
  60. Oxford Nanopore plans to introduce a new FASTQ encoding scheme using an ASCII offset of 48 with optional emoji (via @torstenseemann)
  61. The HMMer package was so named when someone asked how it worked, and the developers said “Hmmmm… errr….” (via @mgollery)
  62. 63% of Bioinformaticists were Biologists to start with, but they realized that the cold room is really COLD! (via @mgollery)
  63. It has been calculated that there are twice as many data formats as there are Bioinformaticians (via @mgollery)

Complete Genomics Revolocity and the future of genome sequencing

I’m a bit late to this, please accept my apologies – I’ve been on holiday!  If you are into genome sequencing then you cannot have missed the announcement of the Revolocity system from Complete Genomics.  So what does this all mean?  Let’s put it all in context.


Key features appear to be:

  • The system costs $12M and occupies 1500 square feet.
  • Automated sequencing from biological samples, including blood and saliva
  • Capable of 10,000 WGS at 50X coverage.  To be increased to 30,000 WGS (no hint of how, though I expect read length to increase)
  • Capable of exomes
  • Has necessary compute and bioinformatics built in (possibly through BGI’s cloud service?)
  • 2x28bp reads apparently covering 96% of the human genome
  • 8 day sequencing time

It’s not just the sequencer, though….

It is a good thing that Illumina have a competitor in this space, as competition drives innovation and brings the cost of sequencing down.  The Revolocity is obviously a competitor for Illumina’s HiSeq X Ten and X Five systems.  However, Illumina have a few key advantages:

  • A history of supporting machines in the field (Complete have always been a sequence-as-a-service model)
  • An international support network of field engineers
  • An international distribution network for spare parts
  • A global production line for flowcells and reagents
  • The administrative support required for selling and supporting all of those things
  • A global network of scientists working with Illumina data for many years
  • A huge ecosystem of free and commercial bioinformatics software

CG will need to reproduce all of the above to genuinely compete.  A key piece of missing knowledge about Revolocity is: how often does it fail and how quickly will CG be able to fix it?  Illumina are not perfect in this, but they are a known entity.  Can Revolocity match or even beat Illumina’s “up time”?

Comparison to HiSeq X

CG have the advantage of automated sample prep from blood/saliva; 50X coverage; bioinformatics apparently included; exome enabled.

The HiSeq X also has automation tie-ins with the NeoPrep and their collaboration with Hamilton; they also have BaseSpace for bioinformatics needs.  So not too different from Revolocity.  We need more details to do a true comparison.

The HiSeq X wins on sequencing time (3 days) and read length (2x150bp).  The latter is key.  Despite CG claiming that they will cover 96% of the genome, figure 1 from this paper suggests that 56bp will get you about 92% of the human genome, whereas 300bp (HiSeq X) gets you about 97.5%

How will Illumina respond?

I have no doubt whatsoever that Illumina will reduce the cost of human genomes further; they will do this in January, at the JP Morgan conference.  I’m guessing a figure closer to $600-800 per genome.

I am less sure, but I think they may also open up the system to human exomes and RNA-Seq.


People have said Revolocity hits the $1000 genome mark, but I have yet to see evidence.  If they can, they are doing incredibly well.  However, we need to see more than press releases. Having been involved in the set-up of a HiSeq X system at Edinburgh Genomics, I can confidently say that the $1200 genome is genuinely achievable – by that I mean you provide us with DNA, we sequence to 30X, align, variant call and provide you with FASTQ, BAM/CRAM and GVCF, and it costs around $1200/£800.  If you are interested in Edinburgh Genomics’ service at that price, please get in touch!

Edit 14/06/2015

Revolocity cost was announced at ESHG:

The Future

I have spoken to a few people and we all agree – Illumina will remain market leader for at least 18-24 months.  That’s a long time in sequencing!  However, it is the PromethION that we think will disrupt Illumina, not Revolocity.  If the PromethION can first match and then surpass MinION’s quality levels, tied to massive increases in throughput, then it will become a serious competitor to Illumina.  Why will this not happen sooner?  It will take time for the new platform to develop, and yet more time for the market to react (a market heavily invested in Illumina).  If/when PromethION does become a threat, it will be fascinating to see how Illumina respond.

Analysing MinION data with no bioinformatics expertise nor infrastructure

It has been a long standing aim of mine to enable easy analysis of Oxford Nanopore MinION data, preferably by people who don’t have a ton of bioinformatics skills.  This was one of the reasons we started to develop poRe, which is really easy to install, runs on any platform and is fairly easy to use.

Below is a proof-of-concept method for analysing MinION data without the need for bioinformatics infrastructure – I am doing this on my Windows laptop, with only R and an internet connection.  We are going to use the EBI web-services to do the heavy lifting.  Please do not abuse this!  EBI probably don’t want you to submit 20,000 jobs.  As I said, this is “proof-of-concept”.  It means you’re not supposed to actually use it in practice ;-)

Some polite things to do:

  1. Use your own, real e-mail address when/if you run the code below
  2. Please stick to a small number of jobs! (unless EBI tell you it’s OK)

We’re going to do everything in R.

Make a directory on your computer where we will do the work.  You will need approx. 15Gb of space (I know, the data are big!).   My directory is going to be “C:/Users/Mick/Documents/MinION”.  We’re going to use data from Nick Loman’s group, published here and here.

# go to the working directory

# load the necessary packages

# download one of Nick's runs and unpack it
url <- ""
dest <- "Ecoli_R7_ONI_flowcell_17.tar.gz"
download.file(url=url, destfile=dest)

This will take some time, as there is about 10Gb of data there.  If I had a smaller dataset, I would use it.

Now, we use the poRe get_fast5_info() function to read statistics from the several thousand MinION reads, which again takes some time.  We then just select the reads that have a 2D sequence:

# read information from all .fast5 files in the current working directory
finfo <-

# get a list of fast5 files where the "len2d" is greater than zero
reads.2d <- rownames(finfo)[finfo$len2d>0]

Now, we are going to use the EBI REST API for NCBI BLAST.  You can read about it at the link provided, but essentially it is about building and querying custom URLs to submit BLAST jobs and fetch the result.

We are only going to do this for the first ten fast5 sequences.  Please do not submit more, or the EBI may block you.  If you want to use this in anger, I suggest contacting the EBI directly and asking them politely.  The code blasts each sequence with default parameters against the EMBL bacterial database, saves the BLAST report to your current working directory, and prints out the top hit.

Here goes:

# iterate over the first ten 2D reads
for (f5 in reads.2d[1:10]) {

    # use poRe function get_fasta() to get the 2D sequence
    	fasta.2d <- get_fasta(f5)$'2D'
    # build a list of parameters for the EBI BLAST service
    	myl <- list(sequence=fasta.2d,

    # build the submission URL and submit the job
    runurl <- ""
    	jobid <- postForm(runurl, .params=myl)

    # query the status URL for the job until the status
    # is no longer "RUNNING"
    staturl <- ""
    	surl <- paste(staturl, jobid, sep="")
    	status <- "RUNNING"
    	while(status == "RUNNING") {
          		status <- getURL(surl)

    # build the output URL and fetch the results
    outurl <- ""
    	ourl <- paste(outurl, jobid, "/out", sep="")
    	out <- getURL(ourl)

    # write the output to a file in the current working directory
    	bloutfile <- gsub(".fast5",".blast",f5)
    	write(out, bloutfile)

    # print out the top hit
    	print(strsplit(out, "\n")[[1]][22])


Now, it’s not quick – it probably takes about 2-3 minutes to complete on my laptop.  It also isn’t very responsive (in that it looks a bit like it may not be doing anything!).  However, eventually the R terminal will show the top hits:


You can see from the above that the first few reads are a bit dodgy and don’t really hit anything, but pretty quickly the reads start hitting E coli K12, which is what they sequenced.

Also, in your directory you will find a whole bunch of BLAST reports:


Note, as “proof-of-concept” code, there is very little (i.e. nothing) in the way of error-checking in the above code!  I mean, come on, it’s only a blog post!


How journals could “add value”

I wrote a piece for Genome Biology, you may have read it, about open science.  I said a lot of things in there, but one thing I want to focus on is how journals could “add value”.  As brief background: I think if you’re going to make money from academic publishing (and I have no problem if that’s what you want to do), then I think you should “add value”.  Open science and open access is coming: open access journals are increasingly popular (and cheap!), preprint servers are more popular, green and gold open access policies are being implemented etc etc. Essentially, people are going to stop paying to access research articles pretty soon – think 5-10 year time frame.

So what can journals do to “add value”?  What can they do that will make us want to pay to access them?  Here are a few ideas, most of which focus on going beyond the PDF:

  1. Live figures: This is something F1000Research are already doing, and there’s no reason other journals couldn’t do the same.  Wherever there is a figure, let readers interact with it – change the colours, axes, chart type etc etc
  2. Re-run analyses in a browser: I think this is something Gigascience have attempted, and would be an incredible training opportunity.  Let readers repeat the entire data analysis, in a browser, using e.g. Galaxy or iPython Notebook
  3. Re-run analyses from the command-line: as above, but provide a VM and a list of commands that readers can run to repeat the analysis
  4. Re-run analyses as an R package: imagine it – every paper has a downloadable R package, with the data included, that allows readers to explore the data within the wonderful statistical and visualisation environment that is R.
  5. Add/remove data – “what happens to graphs, tables and statistics if I remove these data points?”
  6. Show discarded data – so the data had outliers that were cleaned/removed?  Show them to me.  Where are they in the graphs?  Where are they in the tables?  Why were they discarded?
  7. What would my data look like? – basically, if the reader has a dataset which is similar to that analysed in the paper, the reader can upload that dataset (in agreed format) and see how the graphs, tables and stats change
  8. Enforce conventions – when you’re parsing that word document and converting to PDF, pick up gene names and automatically suggest community standards (e.g. HGNC)
  9. Enforce data submission (or do it for the authors).  Yes, do not publish unless the data are submitted to a public archive.  In fact, help the authors do it!
  10. Find similar data in…. – help readers find similar (public) datasets in public databases
  11. Actually check the statistics.  Yes you, the journal.  Most biologists aren’t qualified to check the statistics and experimental design, or do power analysis, so you do it.  Employ some statisticians!

OK, so I’m not pretending any of the above is easy, but I unsure why none of the above is happening – some publishers make HUGE PROFITS, why on earth have they not updated their product?  Imagine if Apple were still trying to sell version 1 of the iPod – no-one would buy it.  Most products get updated on a frequent basis to keep customers happy, yet journals have stuck with the PDF/Print version for decades.  Time for an update, especially if you want to keep the money flowing in.