Have fun with it!
Quite a few others have had their say on Oxford Nanopore‘s MinION sequencer, and so I thought I would write down a few of my own. At Edinburgh Genomics, we’ve been working with the MinION since the first day of the MAP, and earlier this month we published one of the first bioinformatics tools to help users work with this platform: poRe. Whilst there have been low points in the process (mostly involving reagents or flowcells that don’t meet ours and ONT’s high standards), we are still incredibly positive about the possibilities this platform offers.
Personally, I am still convinced that the future of DNA sequencing is nanopore based, but I’m getting ahead of myself. I want to make a few points.
Some perspective (part I)
Illumina are a $7 billion dollar company who currently dominate the sequencing market, and over 90% of DNA sequencing data produced today comes from one of their platforms; Thermofisher, who own LifeTech, who own Ion Torrent, employ 50,000 people and had revenues of $13.1billion in 2013. Even Pacific Biosciences, a minnow compared to the previous two, can be considered established, having launched their first sequencer in 2010 and with 100s of publications resulting from ther technology.
Now consider Oxford Nanopore. A company that employs around 180 people, based in a science park on the outskirts of Oxford. A company that didn’t have a product up until 2014, when they launched the MinION to lucky collaborators. A small company with an exciting technology in a competitive market.
Firstly, If you expect ONT to behave like the other three companies I mentioned, then you must be insane.
Secondly, ONT have a disruptive technology that could, potentially, dominate the sequencing market for years to come. This puts ONT in a position of strength; but it also makes them vulnerable to attack. They must be very careful about how they operate, and they need to trust the people that they work with. I don’t understand what some people don’t get about that.
Some perspective (part II)
If you are currently sat in front of a bunch of MinION data wondering how on earth you’re ever going to make it do what you can do with your Illumina data, then go into your lab, take the MinION, put it back in the box and send it back to ONT. Seriously. Wondering how you can fit MinION into your existing workflows is like sitting in front of a space ship and wondering how you’re going to use it to commute to work; it’s like sitting in front of a time machine, and wondering if you could use it to get to the shops before they close. The potential applications of a mobile DNA sequencer are incredible.
You have in your possession, if you are lucky, the world’s first mobile DNA sequencer. It’s 4″ long. The fact it can produce data of any kind from DNA is a miracle. You are in possession of a miracle. If your mind isn’t ablaze with amazing applications, then give up on science and go home. Seriously.
I have nothing against Alexander Mikheyev or Mandy Tin, I don’t know them nor have I ever met them. I wish them every success in their future, I genuinely mean that. Alexander, Mandy, if you are reading this, I mean you no harm. But that paper is terrible. It’s just lazy. You ran the MinION a few times, got poor data, and… that’s it. That’s not something that should be written up into a paper. Seriously, what a waste of everyone’s time, yours and mine. If this is science, then I am depressed for science. Alexander, Mandy – I am 100% positive that you are better than this.
Some fuss has been made about the authors being kicked out of the MAP. Let me tell you about the MAP. The MAP is an amazing thing. What most sequencing companies do with a new tech, is they take it to the big genome centres (Sanger, Broad, WashU, BGI etc), they get down on their knees and they say “please make it work; please adopt it and say nice things about it”. This is one of the reasons why most new bioinformatics tech comes out of genome centres – they always get first glance at new sequencing technology, and so they’re in pole position when it comes to writing new algorithms to deal with it. MAP is different. I’m not saying they haven’t put MinION into Sanger, Broad etc – but they’ve also given it to lone scientists, to small groups, to medics, to public health, to vets. ONT have placed around 500 MinIONs in over 20 countries.
The MAP is about collaboration; it is about trust.
At no point have I doubted that ONT would let us publish data. At no point have I felt controlled, or restricted in what we could do. ONT simply want to know what MinION data look like in others’ hands, both experts and novices. In return for the platform, they want to see the data and understand why it looks the way it does. It’s a collaboration. It’s about give and take.
The “hit and run” paper of Mikheyev and Tin just doesn’t fit into that framework. Not in any way. Technically speaking, “self certification” is a statement to ONT that the platform is delivering data that are “good enough for the applications I want to use it for”. Mikheyev and Tin self-certified. Where are the applications then guys?
Don’t you think it was a bit dishonest to self-certfiy and then publish that paper?
Just to be clear – only two groups have left the MAP; one group who left voluntarily as they didn’t have enough time to devote to MinION, and Mikheyev and Tin. On the flip side, many additional people have been admitted.
Working with the data
Some very talented people in my lab have been working with the data, and I have been remarkably hands off. However, here are some of my opinions about the data:
- I don’t get the impression that generating ultra-long reads is going to be a problem. What goes through the pore is what’s in the sample, and if you have long fragments, you get long reads
- If you take the reads, align against a reference, you cover the whole genome, and can call the reference with 100% accuracy.
- You can call SNPs.
- Larger genome scaffolding is possible, and results are similar to scaffolding with PacBio.
- (yes I realise these are the boring questions I referred to above; the answer is “yes you can”; now it’s time to go dream of amazing applications)
- The quality needs to improve, base-calling needs to improve, throughput needs to improve – and I believe all of them will
- Error correction strategies look very promising and I think you will see papers on this in the next few months
A final word
I have respect for all sequencing companies and I believe all of them have something to offer. Illumina are amazing – talk about a company that delivers! They are about to enable routine medical genomics, and sequence entire countries. What they’ve done with their technology is incredible. Believe me, my pom poms still get used!
As I have said many times, I have huge respect for the way PacBio have turned themselves around. They were going nowhere just a few years ago, and now they are essential to most new genome sequencing efforts. Credit where it is due, though I do wonder if they have anything left in the tank.
Now there is ONT – a tantalising technology, their place in history is assured having produced the first mobile DNA sequencer.
Working with companies is a skill, I get that; a skill that many academics clearly lack. ONT have a great technology, and as a result, they are vulnerable to attack. As their collaborators, we have to recognise that and work with them, not against them. For example, it may be coincidence, but take a look at PacBio’s share price on 3rd September, the day Nick Loman gave a talk about the great things he is doing with ONT data. If those two events are related, then PacBio will be worried. Does anyone think they will just sit back and let ONT take their long-read crown?
The MinION is incredible. Nanopore sequencing is here and it’s here to stay, in my opinion. Don’t get me wrong, there’s a long way to go; lots of improvements need to happen. And they will. But the people that will make those improvements, both in the technology and in the bioinformatics algorithms to deal with the data, will be positive, forward-thinking people, people who approach science and data with optimism and an open mind.
Let’s be those people.
I wanted to tell the story of our new software, poRe, which is an R package written to help users deal with Oxford Nanopore’s MinION sequencer. We recently put the pre-print on bioRxiv, and it is getting plenty of attention. So let’s get into it!
The MinION device
Whichever way you look at it, the MinION is a revolutionary device. About 6 inches long, it plugs into the USB port of a Windows laptop (which is mandatory, to enable the proper running of the device). As you will know, the MinION measures single strands of DNA as they pass through protein nanopores, and is capable of ultra-long reads – over 100Kb have been reported.
It is very easy to imagine applications of this device which involve mobile sequencing – for example, picture a vet sat in a barn next to a sick animal, taking blood and sequencing DNA in real time to discover which pathogen is causing the illness.
However, as an early acecss user of the MinION, it became very clear that we would need to develop some software to begin to approach this vision. The MinION is clearly aimed at non-experts, biologists who want to use it for rapid sequencing “in the field”. To enable this, we need to give them software they can use, and that software needs to work on the WIndows laptop itself. Relying on uploading the data to a server and analysing it there is not a solution that can work for “in the field” sequencing (people have said to me that you can access the internet anywhere now; they are often city folk, who don’t realise that 3G/4G coverage doesn’t extend to huge swathes of the countryside, nevermind 3rd world countries who don’t have the infrastructure)
The MinION workflow
Once sample has been applied to the MinION, and DNA molecules pass through nanopores on the flowcell, data collection begins. If a channel passes a number of QC metrics, each sequence read is written to a file. Due to a hairpin adapter, each DNA molecule can be read twice, termed “template” and “complement” reads. Whether or not the molecule is read only once or twice, the raw data are written to a file on the laptop’s hard disk drive.
This directory is monitored by an agent, and once new files are discovered, they are queued for upload to metrichor, a cloud-based base-caller which takes the raw data and calls the nucleotide sequence. The base-called files are downloaded and stored in a sub-folder.
The organisation challenge
Here the challenge begins. All files produced by the MinION are in HDF5 format, a binary hierarchical data format. These files require specialised software to open. In addition, the MinION dumps all data files from multiple runs into a single directory, embedding run information within the HDF5 file (called .fast5 files). Each run can produce 10,000-30,000 files, and therefore very quickly users are presented with a directory with 100,000s of files in them and no way or organising them into run folders. Finally, there have been and continue to be multiple versions of metrichor (the base-caller). Therefore each data file can be base-called multiple times. This is quite a large and complex data set.
The data extraction challenge
Lots of information is embedded within each .fast5 file, including run metrics and the nucleotide sequences that most people are interested in, and these data can only be extracted programmatically using the HDF5 library. This represents a challenge for bioinformaticians, nevermind biologists (or vets!).
As a package for R, poRe runs on Windows and has an incredibly simple installation path (just two additional libraries need to be installed manually). poRe enables users to organise the MinION directory into run folders, and use the version of the metrichor base-caller as a sub-folder. poRe also allows extraction of the sequence data as fastq, and collects the run statistics into an easy-to-use data.frame within R. A number of plots are built in, and of course users can plot their own graphs if they wish. poRe has also been tested on Linux, and at least one user is using it on Mac.
Edited 30/07/2014 – I got confused about who wrote what and when – apologies to Embriette and Jonathan
There’s a lot of discussion about Neil Hall’s paper on the Kardashian index, and this is classic Neil stuff – funny and provocative. I’m not going to discuss this paper directly, enough are doing it already.
Over on the microBEnet blog, Embriette Hyde states:
A high K index indicates that said scientist may have built their reputation on a shaky foundation (i.e. the Kardashians, circled in Figure from the paper below), while a low K index indicates that said scientist is not being given credit where credit is due.
I think we’re coming dangerously close to making judgements about quality, and what we’re actually measuring is citations – and they’re not the same thing!
For me, the number of citations a particular paper gets is based on:
- The size of the field e.g. if you work in cancer (a large field), your paper has a higher probability of being cited than if you work on an obscure fungus that only grows in Lithuanian coal mines (probably a very small field, but I may be wrong).
- Given the size of the field, next comes the relevance of your work. So, sticking with the cancer example, you may work with a high prevalence cancer, such as lung cancer (70,000 deaths last year in the UK) or cancer of the penis (111 deaths in the UK last year)
- Then, given the size of the field, and the importance of the work, comes relevance. Perhaps you discovered an oncogene that affects 60% of the population; alternatively you have found an oncogene that affects 2% of a small tribe from the Peruvian rainforest.
- Finally comes the quality of your work. It is quite easy to imagine someone who does incredibly high quality work in a niche area getting far fewer citations than someone who does really crappy work in a larger, more important field
To be fair to Neil, and Embriette, neither directly state that citations = quality, in fact Neil deliberately uses the phrase “scientific value”, which is ill defined.
However, the fact still remains that what Neil is implying is that scientists with low numbers of citations are somehow less than those with lots of citations. And that’s just not right.
And Embriette’s wording implies that scientists with low citations (and high numbers of Twitter followers) are on “shaky foundations”. I say the assumptions behind that wording are on shaky foundations!
Some really very high quality work never gets cited (we can argue another time about the value of such research). And some really quite shoddy work gets huge numbers of citations. So I get to have the last word: number of citations is not a measure of quality.
“Ah welcome! Come in, come in!” said the institute director as Professor Smith appeared for their scheduled 2pm meeting. “I want to talk to you about your latest proposal”, the director continued.
“Oh?” replied Smith.
“Yes. Now, let’s see. It’s an amazing, visionary proposal, a great collaboration, and congratulations on pulling it together. I just have one question” said the director “This proposal will generate a huge amount of data – how do you plan to deal with it all?”
“Oh that’s easy!” answered Smith. “It’s all on page 6. We’ve requested funds to employ a bioinformatician for the lifetime of the project. They’ll deal with all of the data” he stated, triumphantly.
The director frowned.
“I see. Do you yourself have any experience of bioinformatics?”
Smith seemed uncertain.
“Then how will you be able to guide the bioinformatician, to ensure they are using appropriate tools? How will you train them?” the director pressed
Smith appeared perplexed by the question.
“We’ll employ someone who has already been trained, with at least a Masters in bioinformatics! They should already know what they’re doing…” Smith trailed off.
The director sighed.
“And what papers will the bioinformatician publish?”
Smith regained some confidence.
“They’ll get co-authorship on all of the papers coming out of the project. The post-docs who do the work will be first author, I will be last author and the bioinformatician will be in the middle”
The director drummed his fingers on his desk.
“What about a data management plan?”
“A data management plan. A plan, to manage the data. Where will it be stored? How will it be backed up? When will it be released?” the director asked
“Same as always, I guess” said Smith. “We’ll release supporting data as supplementary PDFs, and we’ll make sure we get every last publication we possibly can before releasing the full data set”
The director shifted uneasily in his seat. “And data storage?”
“Don’t IT deal with that kind of stuff?” Smith answered.
An awkward silence settled over the office. The director stared at Professor Smith. Finally he broke the silence.
“OK, so you have this bioinformatician, you give them the data, and they analyse it and they give you the results. How will you ensure that they’ve carried out reproducible science?”
“Reproducible what? What the hell are you talking about?” Smith answered angrily.
The director slammed his hand down on the desk.
“At least tell me you have a plan for dealing with the sequence data!”
“Of course!” said Smith “We’ve been doing this for years. We’ll keep the sequences in Word documents….”
an amber light started flashing on the director’s desk
“… annotate genes by highlighting the sequence in blue…”
the flashing light turned red
“… annotate promoters by highlighting the sequence in orange…”
Smith’s sentence was interrupted by a noisy klaxon suddenly going off, accompanied by a bright blue flashing light that had popped up behind the director’s chair. Smith looked wide-eyed, terrified.
The director pressed a few buttons on his desk and the noisy alarm ceased, the blue light disappeared.
Smith, removing his hands from his ears, asked “What the hell was that?”
The director stood, walked over to the window and sighed heavily. “I’m sorry, Smith. I had a feeling this might happen. Look… this may appear harsh, but… you’re not allowed bioinformatics anymore”
“As I said. You’ve crossed the threshold. You’re not allowed bioinformatics anymore”
Smith’s mouth flapped open and shut as he tried to take in the news.
“You mean no-one will analyse my data?”
The director turned to face Smith.
“Quite the contrary, Smith. Good data will always be welcome, and yours will be treated no differently. It’s just that you won’t be in charge of the storage and analysis of it anymore. You can generate the data, but that will be the end of your involvement. The data will be passed to a bioinformatics group who know what to do with it.”
Smith was furious.
“Are you insane? That’s my data! I can do whatever I like with it! Bioinformaticians won’t know what to do with it anyway!”
“On the contrary” replied the director “It’s not your data. Your research is funded by the government, which is in turn funded by the tax payer. The data belong in the public domain. As for bioinformaticians, they’re scientists too and they’ll be able to analyse your data just as well as you can, probably better”
“I’ve never heard anything so ridiculous! Who decided that I’m not allowed bioinformatics anymore?”
“The Universe? Why should the Universe say I’m not allowed bioinformatics anymore?”
“Because you haven’t paid bioinformatics enough attention. It’s not a support service, at your beck and call. It’s a science. Bioinformaticians are scientists too. Young bioinformaticians need support, guidance and training; something you’re clearly not qualified to provide. They also need first-author papers to advance their careers”
“I don’t understand. What do you mean, they’re not support?!” spluttered Smith.
The director continued regardless of the interruption.
“You’ve had the opportunity to learn about bioinformatics. We’ve had a bioinformatics research group at the institute for over ten years, yet you only ever speak to them at the end of a project when you’ve already generated the data and need their help!”
“The bioinformatics group?! They’re just a bunch of computer junkies!”
The director was beginning to get angry.
“Quite the opposite. They publish multiple research papers every year, and consistently bring in funding. More than your group, actually”.
Smith looked stunned.
“But, but, but… how can this be possible? You’ll never get away with this!”
“I’m afraid I can and I will” said the director. “Science has changed, Smith. It’s a brave new world out there. Bioinformatics is key to the success of many major research programmes and bioinformaticians are now driving those programmes. Those researchers who embrace bioinformatics as a new and exciting science will be successful and those that don’t will be left behind.”
The director stared pointedly at Professor Smith. Smith was defeated, but still defiant.
“It doesn’t matter. We have tons of data we haven’t published yet. I’ll be able to work on that for decades! I don’t need new data, I have plenty of existing data”.
A smile flittered at the corners of the director’s mouth.
“Here’s the thing, Smith. As soon as that alarm went off, all of your data were zipped into a .tar.gz archive and uploaded to the cloud. It’s no longer in your possession”.
Smith looked horrified.
“What’s the cloud? How do I access it? What is a .tar.gz file and how do I open it?”
“You know” said the director “keep asking questions like that, and you might get bioinformatics back”
If you are leading a project that creates huge amounts of data, instead of employing a bioinformatician in your own group, why not collaborate with an existing bioinformatics group and fund a post there? The bioinformatician will benefit hugely from being around more knowledgeable computational biologists, and will still be dedicated to your project.
The above was hugely Inspired by “Ballantyne T (2012) If only … Nature 489(7414):170-170”. I hope Tony doesn’t mind.
Titus Brown has written a somewhat idealistic post on replicable bioinformatics papers, so I thought I would write some of my ideas down too :-)
1. Create a new folder to hold all of the results/analysis. Probably the best thing to do is name it after yourself, so try “Dave” or “Kelly”. If you already have folders with those names, just add an index number e.g. “Dave378″ or “Kelly5142″
2. Put all of your analysis in a Perl script. Call this analysis.pl. If you need to update this script, don’t add a version number, simply call these new scripts “newanalysis.pl”, “latestanalysis.pl”, “newnewanalysis.pl”, “newestanalysis.pl” etc etc
3. Place a README in the directory. Don’t put anything in the README. Your PI will simply check that the README is there and be satisfied you are doing reproducible research. They won’t ever read the README.
4. Write the paper in Word, called “paper.docx”. Send it round to all co-authors, asking them to turn on track changes. Watch in horror has 500 different versions come back to you, called things like “paper_edited.docx”, “paper_mw.docx”, “paper_new.docx” etc etc. Open each one to see that it now looks like Salvadore Dali had an epilectic fit in a paint factory.
5. When reviewer comments come back 6 months later asking for some small detail to be changed, have a massive panic attack as you realise you have no idea how you did any of it. Start the whole analysis again, in a new folder (“Dave379″ or “Kelly5143″) and pray to God that you somehow miraculously come up with the same results and figures.
6. After the paper has been accepted, and the copy editor insists that all figures are 1200 dpi, first look up dpi so you know what it means, and then wrestle with R’s png() and jpeg() functions. Watch as your PC grinds away for 300 hours to produce a scatterplot that, in area, is roughly the size of Russia and comes in at 30Tb. Attempts to open it in an image viewer crash your entire network.
7. Weep silently with joy when someone tells you about ImageMagick, or that the journal will accept PDF images.
8. Upon publication, forget any of this ever happened.
Some time ago, I published a post called “A guide for the lonely bioinformatician” – this turned out to be one of my most popular posts, and has over 10,000 views to date. Whilst I wrote that post to try and help those that find themselves as lone bioinformaticians in wet labs, that wasn’t initially my main motivation; at first, my main motivation had been panic – panic at the amount of bad science that lone bioinformaticians, without support, might produce.
Let me be clear, this isn’t the fault of the lone bioinformaticians themselves – any young scientist working in isolation will make mistakes – it is the fault of the PIs and heads of labs who employ said lone bioinformaticians with no cogent plan on how to support them.
You may get a sense of my motivation from the post itself:
I’ve seen more than one project where the results were almost 100% crap because a bioinformatician acted in isolation and didn’t ask for help
Then yesterday I had this conversation on Twitter:
To summarise, we started with a clinical lab saying, quite rightly, that it is hard to recruit bioinformaticians; there were then many comments about how labs often want to employ people with rare and in-demand skills (so called “bioinformatics unicorns”) on poor salaries or boring projects, and that’s why it is difficult to recruit.
I agree with this, but that’s not the point I want to make here.
Many of you will be ahead of me at this point, but let me spell it out. Lone bioinformaticians will make mistakes, often elementary mistakes, because they don’t have peer support or access to an expert in bioinformatics who can help them. This matters less in research labs investigating e.g. the evolution of sea squirts, but clinical labs deal with data that can actually affect a patient’s health and/or treatment.
I am aware of a few lone bioinformaticians working in clinical labs. I want to make this clear – this is a bad idea. In fact, it’s a terrible idea. Through no fault of their own, these guys will make mistakes. Those mistakes may have dire consequences if the data are then used to inform a treatment plan or diagnosis.
So what’s the solution? I can think of two:
- Pay more and employ more senior bioinformaticians who know what they’re doing, and build a team around those experienced bioinformaticians
- Collaborate with a bioinformatics group experienced in human genomics/genetics
To any lone bioinformaticians working in clinical labs, I would say this: find support; find help; make sure the things you are doing are the right things; have your pipelines reviewed by an independent external expert. Don’t be alone. This isn’t a personal attack on you – every young scientist makes mistakes, I certainly did – you need support and it’s important you get it.
To the clinical labs: I understand there are funding issues. This isn’t a personal attack on you either. But employing a lone (and inexperienced) bioinformatcian will almost certainly result in mistakes being made that would have been avoided by someone more experienced. Please consider the options 1 and 2 above.