poRe: an R package for the visualization and analysis of nanopore sequencing data

I wanted to tell the story of our new software, poRe, which is an R package written to help users deal with Oxford Nanopore’s MinION sequencer.  We recently put the pre-print on bioRxiv, and it is getting plenty of attention.  So let’s get into it!

The MinION device

Whichever way you look at it, the MinION is a revolutionary device.  About 6 inches long, it plugs into the USB port of a Windows laptop (which is mandatory, to enable the proper running of the device). As you will know, the MinION measures single strands of DNA as they pass through protein nanopores, and is capable of ultra-long reads – over 100Kb have been reported.

It is very easy to imagine applications of this device which involve mobile sequencing – for example, picture a vet sat in a barn next to a sick animal, taking blood and sequencing DNA in real time to discover which pathogen is causing the illness.

However, as an early acecss user of the MinION, it became very clear that we would need to develop some software to begin to approach this vision.  The MinION is clearly aimed at non-experts, biologists who want to use it for rapid sequencing “in the field”.  To enable this, we need to give them software they can use, and that software needs to work on the WIndows laptop itself.  Relying on uploading the data to a server and analysing it there is not a solution that can work for “in the field” sequencing (people have said to me that you can access the internet anywhere now;  they are often city folk, who don’t realise that 3G/4G coverage doesn’t extend to huge swathes of the countryside, nevermind 3rd world countries who don’t have the infrastructure)

The MinION workflow

Once sample has been applied to the MinION, and DNA molecules pass through nanopores on the flowcell, data collection begins.  If a channel passes a number of QC metrics, each sequence read is written to a file.  Due to a hairpin adapter, each DNA molecule can be read twice, termed “template” and “complement” reads.  Whether or not the molecule is read only once or twice, the raw data are written to a file on the laptop’s hard disk drive.

This directory is monitored by an agent, and once new files are discovered, they are queued for upload to metrichor, a cloud-based base-caller which takes the raw data and calls the nucleotide sequence.  The base-called files are downloaded and stored in a sub-folder.

The organisation challenge

Here the challenge begins.  All files produced by the MinION are in HDF5 format, a binary hierarchical data format.  These files require specialised software to open.  In addition, the MinION dumps all data files from multiple runs into a single directory, embedding run information within the HDF5 file (called .fast5 files).  Each run can produce 10,000-30,000 files, and therefore very quickly users are presented with a directory with 100,000s of files in them and no way or organising them into run folders.  Finally, there have been and continue to be multiple versions of metrichor (the base-caller). Therefore each data file can be base-called multiple times.  This is quite a large and complex data set.

The data extraction challenge

Lots of information is embedded within each .fast5 file, including run metrics and the nucleotide sequences that most people are interested in, and these data can only be extracted programmatically using the HDF5 library.  This represents a challenge for bioinformaticians, nevermind biologists (or vets!).

poRe

As a package for R, poRe runs on Windows and has an incredibly simple installation path (just two additional libraries need to be installed manually).  poRe enables users to organise the MinION directory into run folders, and use the version of the metrichor base-caller as a sub-folder.  poRe also allows extraction of the sequence data as fastq, and collects the run statistics into an easy-to-use data.frame within R.  A number of plots are built in, and of course users can plot their own graphs if they wish.  poRe has also been tested on Linux, and at least one user is using it on Mac.

Check it out!  Tutorial here; software here.

 

poRe screenshots

Citations are not a measure of quality

Edited 30/07/2014 – I got confused about who wrote what and when – apologies to Embriette and Jonathan

There’s a lot of discussion about Neil Hall’s paper on the Kardashian index, and this is classic Neil stuff – funny and provocative.  I’m not going to discuss this paper directly, enough are doing it already.

Over on the microBEnet blog, Embriette Hyde states:

A high K index indicates that said scientist may have built their reputation on a shaky foundation (i.e. the Kardashians, circled in Figure from the paper below), while a low K index indicates that said scientist is not being given credit where credit is due.

I think we’re coming dangerously close to making judgements about quality, and what we’re actually measuring is citations – and they’re not the same thing!

For me, the number of citations a particular paper gets is based on:

  1. The size of the field e.g. if you work in cancer (a large field), your paper has a higher probability of being cited than if you work on an obscure fungus that only grows in Lithuanian coal mines (probably a very small field, but I may be wrong).
  2. Given the size of the field, next comes the relevance of your work.  So, sticking with the cancer example, you may work with a high prevalence cancer, such as lung cancer (70,000 deaths last year in the UK) or cancer of the penis (111 deaths in the UK last year)
  3. Then, given the size of the field, and the importance of the work, comes relevance.  Perhaps you discovered an oncogene that affects 60% of the population; alternatively you have found an oncogene that affects 2% of a small tribe from the Peruvian rainforest.
  4. Finally comes the quality of your work.  It is quite easy to imagine someone who does incredibly high quality work in a niche area getting far fewer citations than someone who does really crappy work in a larger, more important field

To be fair to Neil, and Embriette, neither directly state that citations = quality, in fact Neil deliberately uses the phrase “scientific value”, which is ill defined.

However, the fact still remains that what Neil is implying is that scientists with low numbers of citations are somehow less than those with lots of citations.  And that’s just not right.

And Embriette’s wording implies that scientists with low citations (and high numbers of Twitter followers) are on “shaky foundations”.  I say the assumptions behind that wording are on shaky foundations!

Some really very high quality work never gets cited (we can argue another time about the value of such research).  And some really quite shoddy work gets huge numbers of citations.  So I get to have the last word:  number of citations is not a measure of quality.

 

You’re not allowed bioinformatics anymore

Ah welcome! Come in, come in!” said the institute director as Professor Smith appeared for their scheduled 2pm meeting. “I want to talk to you about your latest proposal”, the director continued.

“Oh?” replied Smith.

“Yes. Now, let’s see. It’s an amazing, visionary proposal, a great collaboration, and congratulations on pulling it together. I just have one question” said the director “This proposal will generate a huge amount of data – how do you plan to deal with it all?”

“Oh that’s easy!” answered Smith. “It’s all on page 6. We’ve requested funds to employ a bioinformatician for the lifetime of the project. They’ll deal with all of the data” he stated, triumphantly.

The director frowned.

“I see. Do you yourself have any experience of bioinformatics?”

Smith seemed uncertain.

“Well, no…..”

“Then how will you be able to guide the bioinformatician, to ensure they are using appropriate tools? How will you train them?” the director pressed

Smith appeared perplexed by the question.

“We’ll employ someone who has already been trained, with at least a Masters in bioinformatics! They should already know what they’re doing…” Smith trailed off.

The director sighed.

“And what papers will the bioinformatician publish?”

Smith regained some confidence.

“They’ll get co-authorship on all of the papers coming out of the project. The post-docs who do the work will be first author, I will be last author and the bioinformatician will be in the middle”

The director drummed his fingers on his desk.

“What about a data management plan?”

“A what?”

“A data management plan. A plan, to manage the data. Where will it be stored? How will it be backed up? When will it be released?” the director asked

“Same as always, I guess” said Smith. “We’ll release supporting data as supplementary PDFs, and we’ll make sure we get every last publication we possibly can before releasing the full data set”

The director shifted uneasily in his seat. “And data storage?”

“Don’t IT deal with that kind of stuff?” Smith answered.

An awkward silence settled over the office. The director stared at Professor Smith. Finally he broke the silence.

“OK, so you have this bioinformatician, you give them the data, and they analyse it and they give you the results. How will you ensure that they’ve carried out reproducible science?”

“Reproducible what? What the hell are you talking about?” Smith answered angrily.

The director slammed his hand down on the desk.

“At least tell me you have a plan for dealing with the sequence data!”

“Of course!” said Smith “We’ve been doing this for years. We’ll keep the sequences in Word documents….”

an amber light started flashing on the director’s desk

“… annotate genes by highlighting the sequence in blue…”

the flashing light turned red

“… annotate promoters by highlighting the sequence in orange…”

Smith’s sentence was interrupted by a noisy klaxon suddenly going off, accompanied by a bright blue flashing light that had popped up behind the director’s chair.  Smith looked wide-eyed, terrified.

The director pressed a few buttons on his desk and the noisy alarm ceased, the blue light disappeared.

Smith, removing his hands from his ears, asked “What the hell was that?”

The director stood, walked over to the window and sighed heavily. “I’m sorry, Smith. I had a feeling this might happen. Look… this may appear harsh, but… you’re not allowed bioinformatics anymore”

“What?”

“As I said. You’ve crossed the threshold. You’re not allowed bioinformatics anymore”

Smith’s mouth flapped open and shut as he tried to take in the news.

“You mean no-one will analyse my data?”

The director turned to face Smith.

“Quite the contrary, Smith. Good data will always be welcome, and yours will be treated no differently. It’s just that you won’t be in charge of the storage and analysis of it anymore. You can generate the data, but that will be the end of your involvement. The data will be passed to a bioinformatics group who know what to do with it.”

Smith was furious.

“Are you insane? That’s my data! I can do whatever I like with it! Bioinformaticians won’t know what to do with it anyway!”

“On the contrary” replied the director “It’s not your data. Your research is funded by the government, which is in turn funded by the tax payer. The data belong in the public domain. As for bioinformaticians, they’re scientists too and they’ll be able to analyse your data just as well as you can, probably better”

“I’ve never heard anything so ridiculous! Who decided that I’m not allowed bioinformatics anymore?”

“The Universe.”

“The Universe? Why should the Universe say I’m not allowed bioinformatics anymore?”

“Because you haven’t paid bioinformatics enough attention. It’s not a support service, at your beck and call. It’s a science. Bioinformaticians are scientists too. Young bioinformaticians need support, guidance and training; something you’re clearly not qualified to provide. They also need first-author papers to advance their careers”

“I don’t understand. What do you mean, they’re not support?!” spluttered Smith.

The director continued regardless of the interruption.

“You’ve had the opportunity to learn about bioinformatics. We’ve had a bioinformatics research group at the institute for over ten years, yet you only ever speak to them at the end of a project when you’ve already generated the data and need their help!”

“The bioinformatics group?! They’re just a bunch of computer junkies!”

The director was beginning to get angry.

“Quite the opposite. They publish multiple research papers every year, and consistently bring in funding. More than your group, actually”.

Smith looked stunned.

“But, but, but… how can this be possible? You’ll never get away with this!”

“I’m afraid I can and I will” said the director. “Science has changed, Smith. It’s a brave new world out there. Bioinformatics is key to the success of many major research programmes and bioinformaticians are now driving those programmes. Those researchers who embrace bioinformatics as a new and exciting science will be successful and those that don’t will be left behind.”

The director stared pointedly at Professor Smith. Smith was defeated, but still defiant.

“It doesn’t matter. We have tons of data we haven’t published yet. I’ll be able to work on that for decades! I don’t need new data, I have plenty of existing data”.

A smile flittered at the corners of the director’s mouth.

“Here’s the thing, Smith. As soon as that alarm went off, all of your data were zipped into a .tar.gz archive and uploaded to the cloud. It’s no longer in your possession”.

Smith looked horrified.

“What’s the cloud? How do I access it? What is a .tar.gz file and how do I open it?”

“You know” said the director “keep asking questions like that, and you might get bioinformatics back”


If you are leading a project that creates huge amounts of data, instead of employing a bioinformatician in your own group, why not collaborate with an existing bioinformatics group and fund a post there? The bioinformatician will benefit hugely from being around more knowledgeable computational biologists, and will still be dedicated to your project.


The above was hugely Inspired by “Ballantyne T (2012) If only … Nature 489(7414):170-170”.  I hope Tony doesn’t mind.

 

How not to make your papers replicable

Titus Brown has written a somewhat idealistic post on replicable bioinformatics papers, so I thought I would write some of my ideas down too :-)

1. Create a new folder to hold all of the results/analysis.  Probably the best thing to do is name it after yourself, so try “Dave” or “Kelly”.  If you already have folders with those names, just add an index number e.g. “Dave378″ or “Kelly5142″

2. Put all of your analysis in a Perl script.  Call this analysis.pl.  If you need to update this script, don’t add a version number, simply call these new scripts “newanalysis.pl”, “latestanalysis.pl”, “newnewanalysis.pl”, “newestanalysis.pl” etc etc

3. Place a README in the directory.  Don’t put anything in the README.  Your PI will simply check that the README is there and be satisfied you are doing reproducible research.  They won’t ever read the README.

4. Write the paper in Word, called “paper.docx”.  Send it round to all co-authors, asking them to turn on track changes.  Watch in horror has 500 different versions come back to you, called things like “paper_edited.docx”, “paper_mw.docx”, “paper_new.docx” etc etc.  Open each one to see that it now looks like Salvadore Dali had an epilectic fit in a paint factory.

5. When reviewer comments come back 6 months later asking for some small detail to be changed, have a massive panic attack as you realise you have no idea how you did any of it.  Start the whole analysis again, in a new folder (“Dave379″ or “Kelly5143″) and pray to God that you somehow miraculously come up with the same results and figures.

6. After the paper has been accepted, and the copy editor insists that all figures are 1200 dpi, first look up dpi so you know what it means, and then wrestle with R’s png() and jpeg() functions.  Watch as your PC grinds away for 300 hours to produce a scatterplot that, in area, is roughly the size of Russia and comes in at 30Tb.  Attempts to open it in an image viewer crash your entire network.

7. Weep silently with joy when someone tells you about ImageMagick, or that the journal will accept PDF images.

8. Upon publication, forget any of this ever happened.

The lonely bioinformatician revisited: clinical labs

Some time ago, I published a post called “A guide for the lonely bioinformatician” – this turned out to be one of my most popular posts, and has over 10,000 views to date.  Whilst I wrote that post to try and help those that find themselves as lone bioinformaticians in wet labs, that wasn’t initially my main motivation; at first, my main motivation had been panic – panic at the amount of bad science that lone bioinformaticians, without support, might produce.

Let me be clear, this isn’t the fault of the lone bioinformaticians themselves – any young scientist working in isolation will make mistakes – it is the fault of the PIs and heads of labs who employ said lone bioinformaticians with no cogent plan on how to support them.

You may get a sense of my motivation from the post itself:

I’ve seen more than one project where the results were almost 100% crap because a bioinformatician acted in isolation and didn’t ask for help

Then yesterday I had this conversation on Twitter:

Bioinformatics Unicorns

To summarise, we started with a clinical lab saying, quite rightly, that it is hard to recruit bioinformaticians; there were then many comments about how labs often want to employ people with rare and in-demand skills (so called “bioinformatics unicorns”) on poor salaries or boring projects, and that’s why it is difficult to recruit.

I agree with this, but that’s not the point I want to make here.

Many of you will be ahead of me at this point, but let me spell it out.  Lone bioinformaticians will make mistakes, often elementary mistakes, because they don’t have peer support or access to an expert in bioinformatics who can help them.  This matters less in research labs investigating e.g. the evolution of sea squirts, but clinical labs deal with data that can actually affect a patient’s health and/or treatment.

I am aware of a few lone bioinformaticians working in clinical labs.  I want to make this clear – this is a bad idea.  In fact, it’s a terrible idea.  Through no fault of their own, these guys will make mistakes.  Those mistakes may have dire consequences if the data are then used to inform a treatment plan or diagnosis.

So what’s the solution?  I can think of two:

  1. Pay more and employ more senior bioinformaticians who know what they’re doing, and build a team around those experienced bioinformaticians
  2. Collaborate with a bioinformatics group experienced in human genomics/genetics

To any lone bioinformaticians working in clinical labs, I would say this: find support; find help; make sure the things you are doing are the right things; have your pipelines reviewed by an independent external expert.  Don’t be alone.   This isn’t a personal attack on you – every young scientist makes mistakes, I certainly did – you need support and it’s important you get it.

To the clinical labs: I understand there are funding issues.  This isn’t a personal attack on you either.  But employing a lone (and inexperienced) bioinformatcian will almost certainly result in mistakes being made that would have been avoided by someone more experienced.  Please consider the options 1 and 2 above.

 

Comment on Piero Anversa controversy

I learned of the issues surrounding Piero Anversa, who has had a paper in Circulation retracted and an Expression of Concern from The Lancet, on Twitter earlier this week.  However, the blog post which I read made me quite uncomfortable, and left me wondering whether we have lost the concept of “Innocent until proven guilty” from science?

Allow me to explain.  The retraction and expression of concern are deeply worrying, and call into question some of the research methods used (by the entire group of scientists, by the way, not just Piero Anversa).  However, the blog post I read goes much further than that, with accusations of fear and threats, of ridicule and of careers ended for asking questions.  These accusations come from an anonymous author, yet we have Piero Anversa’s name and image right there in the post for all to see.

I’m not saying that the post is incorrect, I’m just uncomfortable that we can publish these accusations (and at the moment, that’s all they are) from an anonymous account yet a scientist’s full name and picture are included.  Is that right? (for the sarcastic amongst you, yes I am aware of the internet and what happens on it ;-))

I am not affiliated with Piero Anversa, I have never met him or communicated with him in any way.  I’m not here to defend him.  But I guess that’s my point – noone is here to defend him, whilst unfounded accusations about him are read by thousands.

Some comments on the issues brought up by the post

If we can just set aside the “unfounded accusations” issue for now, the blog post brings up several important issues:

  • I have certainly been involved with projects and scientists where the theory dictates the data, i.e. the theory is stated first and the data are made to fit the theory (Don’t try and figure out who, noone I currently work with does this).  So I am not surprised by this accusation and I would not be surprised if it is true.  It’s bad science.  I have no idea how common it is.  However, my approach has always been to quietly remove myself from the project, and I suggest anyone who is involved in such work, at whatever level you are, do the same.  I realise this may seem like career suicide, but being associated with a compromised paper is also career suicide.  Just get out.
  • The post also mentions “Machiavellian Principles”, and actually I think is is scarily accurate.  I’d say Machiavellian politics are the dominant form within academic scientific research, with a “divide and rule” approach to the competition, and anonymous peer review forming a perfect weapon to “destroy” the opposition.  We should remove this weapon.  However, I see these most often between groups, not within a group.
  • I have never seen the kind of behaviour that the blog post mentions; the naked threats, the fear, the reward of simple obedience.  Maybe I have just been lucky?   Does this actually happen?

Summary

We need to be very wary of making unproven accusations from an anonymous account about named scientists.  This seems very unfair and actually very unscientific.

For me, the most important issue the blog post raises is the point that some scientists put the theory before the data, and make the data fit the theory.  This is clearly wrong and needs to stop.  Whether Piero Anversa is guilty of this, we do not yet know – however, I’d say that some scientists are guilty of it, and that’s what we need to address.

We have the $1000 genome; what’s next?

Well, we got there, didn’t we?  And when I say “we”, I mean “Illumina”.  The $1000 genome is clearly here.  This has been a goal of genomics for so long, that we are left asking – what’s next?  If any of you are thinking “The $100 genome!” then please leave this blog now – you’re not welcome.  Obvious changes like this are intellectually bankrupt, and annoy the hell out of me.

The next step is pretty obvious, and I won’t be the first to say this: “Genome at home“.  That should be the next challenge of genomics, the equivalent of “The $1000 genome”.  And when I say “Genome at home”, I mean everything at home – sequencing and analysis.  What we need is technology that can take a sample from a person sitting in their own home, sequence the genome, and upload the data to software sitting on a laptop that can analyse the data and tell the person what it means.

I can already anticipate the comments/emails from companies telling me they can already do this (at least from a software perspective).  Save your “ink” – you can’t.  Keep trying though.

Some people may point towards the MinION USB sequencer, and I thnk this is the closest device to being able to generate a “Genome at home”, but there are three barriers still to be overcome: 1) I don’t think the MinION throughput is human-genome ready yet; 2) sample prep still needs to be done, and you need molecular biology skills to do it; 3) we don’t know how good the data are yet.

Of course, as is true of every technology, the “raw data to clinical interpretation” software doesn’t exist yet, though many are trying.

So there we are – the challenge that I think should replace “The $1000 genome” is “Genome at home“.