Proposal: identifying contaminated cancer cell lines

This morning, Dan Graur tweeted this explosive article:

I recommend everyone reads it.  tl;dr – lots of cancer cell lines are not what they’re supposed to be, having been contaminated and overtaken by other, perhaps more aggressive cell lines.

With the advent of NGS, this seems like something we could tackle relatively easily.  For example, cell lines will have (i) signature gene expression profiles; (ii) signature SNP profiles; and (iii) signature CNV profiles.

It shouldn’t be too difficult to set up a service, linked to the public databases, that can check all submitted data against known (contaminant) cell lines and which could identify datasets that perhaps come from a different cell line to that which is reported.

I propose that the funding agencies immediately fund EBI/NCBI to set up such a service, attached to the major sequence repositories, that can identify possible cell line contamination.

Why anonymous peer review is bad for science

I got into a really long, and often interesting, conversation on Twitter a few days ago about the merits, or not, of open peer review. 140 characters is a bit limiting, so I am putting my arguments here.

My regular readers know that I am a big supporter of open peer review, and I have signed grant and manuscript reviews for about 2 years now – crucially, I sign them whether they are positive or negative.  However, what I really want to do is change the way we see peer review – in my opinion, we should see it as a supportive and collaborative process by which a group of independent scientists assess the quality of a body of research, suggest ways in which it might be improved and decide whether it is ready for publication.  Some of this is encapsulated in my reviewer’s oath.   Peer review doesn’t have to be the often awful process that it often is, and it also doesn’t have to be easy on the authors.

Anyway, below are the arguments that I have read about why peer review should remain anonymous.  My response is below.

“Anonymous peer review protects reviewers”

I am immediately concerned by the language here.  Protect reviewers from what?!  Is peer review suddenly a dangerous activity?  The point is made even more forcefully in this quite terrifying piece, which states:

It might be especially difficult to find referees for authors who hold positions of power and influence, or for those who are considered quarrelsome or vindictive by their peers. In particular, younger, less-established scientists … would be reluctant to reveal themselves, for fear of retaliation from their more powerful colleagues.

OK, OK.  Hands up who in science is happy with the idea that in our field “powerful” and “vindictive” scientists might want to “retaliate” against someone who has reviewed their work?!!  Does anyone seriously think that that’s OK?  Why are we even discussing peer review if we work in a field where this might happen?  Stop the clock.  Time out.  Pause.  Retaliatory, revenge attacks, by anyone, should be considered serious scientific misconduct and the perpetrators should be identified and sacked.  It’s as bad, worse, than plagiarism or making up data.  This kind of activity should spell the end of careers.  Are we expected to sit back and accept that this kind of thing might happen?!

Of course, this argument in support of anonymous peer review is actually a very powerful argument against it.  Powerful or vindictive scientists are only able to take revenge attacks because they can hide behind a cloak of anonymity.  If their reviews were published, alongside their names, then the community would soon recognise if they were behaving badly, and action would soon follow.  The collective scientific community are actually very ethical and moral, with a strong sense of fair play.  Should a scientist identify retaliatory, revenge reviews, with evidence, we would all listen.  And act.

“Open peer reviews are ‘soft'”

I refer you to this paper, which states:

Asking reviewers to consent to being identified to the author had no important effect on the quality of the review, the recommendation regarding publication, or the time taken to review

So in fact there is no evidence that open peer reviews are in any way “soft”. Hilariously, the Nature Neuroscience piece throws out this peer reviewed, scientific evidence in favour of “an informal poll of some of our referees”.  What?!  You can’t just throw away a paper’s conclusions based on a conversation you may or may not have had with a few carefully chosen referees!

There is no evidence that open peer reviews are soft, and I therefore declare this argument invalid.

“Open peer review promotes ‘quid pro quo’ favourable reviews

Yes, in fact I think open peer review probably does enable this kind of unethical, fraudulent behaviour;  my simple point is that anonymous peer review doesn’t stop it either!  They’re as bad as each other.  There is absolutely nothing to stop an anonymous reviewer emailing the authors of the study, praising their amazing research, pointing out they were reviewer 2 and inviting them to review their future papers.

Open peer review does not solve the “favourable reviewer ring” problem; but nor does anonymous peer review!

“Anonymity removes egos from the equation”

This one actually make me laugh out loud!  Sure, because as anyone who has ever been involved with anonymous peer review, egos are completely absent…..

“Open review will make it hard to find reviewers”

Yes, I think this is probably true and is backed up by the BMJ paper above.  I have one simple riposte to this.  When you look at a body of work, with the aim of reviewing it, the only thing you should be thinking about is the quality of the science (and perhaps also the impact, if the editor wants you to consider that).  If you would be scared to reveal your name to the authors, then you are already showing evidence that you’re unable to give an objective review.  You are already thinking, consciously or subconsciously, about who the authors are, what they represent to you, how they might affect you and your work.  You cannot be objective and you shouldn’t review the paper, anonymously or otherwise.  No review is better than a biased review.

“Anonymity corrects for unequal power relationships”

No, anonymity enables unequal power relationships, and open data, open information, open peer review equalizes those power relationships and ensures all interactions can be assessed independently for fair play.

One of the most telling quotes from the Nature Neuroscience piece is this one:

The journal also experienced an occasional breakdown of the peer-review process, in which authors and referees bypassed the editors completely in negotiating how a paper should be revised.

OH MY GOD!  Stop the planet!  Everyone just HOLD!  This is serious!  Imagine, authors and reviewers collaborating together to make a piece of work better, outside of the control of a journal!  THIS MUST BE STOPPED!

(seriously, you can’t make this up!)

In case you hadn’t noticed, the BMJ has practised open peer review since 1999, and they took this decision on the basis that it was simply more ethical, without cutting down on quality. Good on them.  They seem to be doing OK for themselves! If you don’t have time to read it, here are some choice quotes:

All editors have seen curt, abusive, destructive reviews and assumed that the reviewer would not have written in that way if he or she were identifiable

openness should eliminate some of the worst abuses of peer review, where reviewers—under the cloak of anonymity—steal ideas or procrastinate

Openness … links accountability with credit

Science is progressively moving away from anonymity.

we hope our small move will contribute to a broader culture change so that junior researchers cease to fear reprisals from senior ones

Inspirational stuff, I am sure you’ll agree.

What else can I add to this?  Just my own experiences.  Putting my name next to reviews has made me, without exception, give more thought to what I write, and that can only be a good thing.  I think more about the evidence for my criticism and whether those criticisms are valid.  More importantly, I empathise with the authors.  This is a good thing.  It doesn’t make me soft, it just helps me understand.  By far and away the biggest reaction I have seen is one, not of anger, but of respect.

Detractors will say that perhaps I have had a negative reaction, I just haven’t seen it; and perhaps they are right.  In fact, the only way I would see is if those reactions were open, and signed with the names of the reviewers responsible.  Which makes my point very nicely.

Thoughts on Oxford Nanopore’s MinION mobile DNA sequencer

Quite a few others have had their say on Oxford Nanopore‘s MinION sequencer, and so I thought I would write down a few of my own.  At Edinburgh Genomics, we’ve been working with the MinION since the first day of the MAP, and earlier this month we published one of the first bioinformatics tools to help users work with this platform: poRe.  Whilst there have been low points in the process (mostly involving reagents or flowcells that don’t meet ours and ONT’s high standards), we are still incredibly positive about the possibilities this platform offers.

Personally, I am still convinced that the future of DNA sequencing is nanopore based, but I’m getting ahead of myself.  I want to make a few points.

Some perspective (part I)

Illumina are a $7 billion dollar company who currently dominate the sequencing market, and over 90% of DNA sequencing data produced today comes from one of their platforms;  Thermofisher, who own LifeTech, who own Ion Torrent, employ 50,000 people and had revenues of $13.1billion in 2013.  Even Pacific Biosciences, a minnow compared to the previous two, can be considered established, having launched their first sequencer in 2010 and with 100s of publications resulting from ther technology.

Now consider Oxford Nanopore.  A company that employs around 180 people, based in a science park on the outskirts of Oxford.  A company that didn’t have a product up until 2014, when they launched the MinION to lucky collaborators.  A small company with an exciting technology in a competitive market.

Firstly, If you expect ONT to behave like the other three companies I mentioned, then you must be insane.

Secondly, ONT have a disruptive technology that could, potentially, dominate the sequencing market for years to come.  This puts ONT in a position of strength; but it also makes them vulnerable to attack.  They must be very careful about how they operate,  and they need to trust the people that they work with.  I don’t understand what some people don’t get about that.

Some perspective (part II)

If you are currently sat in front of a bunch of MinION data wondering how on earth you’re ever going to make it do what you can do with your Illumina data, then go into your lab, take the MinION, put it back in the box and send it back to ONT.  Seriously.  Wondering how you can fit MinION into your existing workflows is like sitting in front of a space ship and wondering how you’re going to use it to commute to work; it’s like sitting in front of a time machine, and wondering if you could use it to get to the shops before they close.  The potential applications of a mobile DNA sequencer are incredible.

You have in your possession, if you are lucky, the world’s first mobile DNA sequencer.  It’s 4″ long.  The fact it can produce data of any kind from DNA is a miracle.  You are in possession of a miracle.  If your mind isn’t ablaze with amazing applications, then give up on science and go home.  Seriously.

That paper

I have nothing against Alexander Mikheyev or Mandy Tin, I don’t know them nor have I ever met them.  I wish them every success in their future, I genuinely mean that.  Alexander, Mandy, if you are reading this, I mean you no harm.  But that paper is terrible.  It’s just lazy.  You ran the MinION a few times, got poor data, and… that’s it.  That’s not something that should be written up into a paper.  Seriously, what a waste of everyone’s time, yours and mine.  If this is science, then I am depressed for science.  Alexander, Mandy – I am 100% positive that you are better than this.

Some fuss has been made about the authors being kicked out of the MAP.  Let me tell you about the MAP.  The MAP is an amazing thing.  What most sequencing companies do with a new tech, is they take it to the big genome centres (Sanger, Broad, WashU, BGI etc), they get down on their knees and they say “please make it work; please adopt it and say nice things about it”.  This is one of the reasons why most new bioinformatics tech comes out of genome centres – they always get first glance at new sequencing technology, and so they’re in pole position when it comes to writing new algorithms to deal with it.  MAP is different.  I’m not saying they haven’t put MinION into Sanger, Broad etc – but they’ve also given it to lone scientists, to small groups, to medics, to public health, to vets.  ONT have placed around 500 MinIONs in over 20 countries.

The MAP is about collaboration; it is about trust.

At no point have I doubted that ONT would let us publish data.  At no point have I felt controlled, or restricted in what we could do.  ONT simply want to know what MinION data look like in others’ hands, both experts and novices.  In return for the platform, they want to see the data and understand why it looks the way it does.  It’s a collaboration.  It’s about give and take.

The “hit and run” paper of Mikheyev and Tin just doesn’t fit into that framework.  Not in any way.  Technically speaking, “self certification” is a statement to ONT that the platform is delivering data that are “good enough for the applications I want to use it for”.  Mikheyev and Tin self-certified.  Where are the applications then guys?

Don’t you think it was a bit dishonest to self-certfiy and then publish that paper?

Just to be clear – only two groups have left the MAP; one group who left voluntarily as they didn’t have enough time to devote to MinION, and Mikheyev and Tin.  On the flip side, many additional people have been admitted.

Working with the data

Some very talented people in my lab have been working with the data, and I have been remarkably hands off.  However, here are some of my opinions about the data:

  • I don’t get the impression that generating ultra-long reads is going to be a problem.  What goes through the pore is what’s in the sample, and if you have long fragments, you get long reads
  • If you take the reads, align against a reference, you cover the whole genome, and can call the reference with 100% accuracy.
  • You can call SNPs.
  • Larger genome scaffolding is possible, and results are similar to scaffolding with PacBio.
  • (yes I realise these are the boring questions I referred to above; the answer is “yes you can”; now it’s time to go dream of amazing applications)
  • The quality needs to improve, base-calling needs to improve, throughput needs to improve – and I believe all of them will
  • Error correction strategies look very promising and I think you will see papers on this in the next few months

A final word

I have respect for all sequencing companies and I believe all of them have something to offer. Illumina are amazing – talk about a company that delivers!  They are about to enable routine medical genomics, and sequence entire countries.  What they’ve done with their technology is incredible.  Believe me, my pom poms still get used!

As I have said many times, I have huge respect for the way PacBio have turned themselves around.  They were going nowhere just a few years ago, and now they are essential to most new genome sequencing efforts.  Credit where it is due, though I do wonder if they have anything left in the tank.

Now there is ONT – a tantalising technology, their place in history is assured having produced the first mobile DNA sequencer.

Working with companies is a skill, I get that; a skill that many academics clearly lack.  ONT have a great technology, and as a result, they are vulnerable to attack.  As their collaborators, we have to recognise that and work with them, not against them.  For example, it may be coincidence, but take a look at PacBio’s share price on 3rd September, the day Nick Loman gave a talk about the great things he is doing with ONT data.  If those two events are related, then PacBio will be worried.  Does anyone think they will just sit back and let ONT take their long-read crown?

The MinION is incredible.  Nanopore sequencing is here and it’s here to stay, in my opinion.  Don’t get me wrong, there’s a long way to go; lots of improvements need to happen.  And they will.  But the people that will make those improvements, both in the technology and in the bioinformatics algorithms to deal with the data, will be positive, forward-thinking people, people who approach science and data with optimism and an open mind.

Let’s be those people.

poRe: an R package for the visualization and analysis of nanopore sequencing data

I wanted to tell the story of our new software, poRe, which is an R package written to help users deal with Oxford Nanopore’s MinION sequencer.  We recently put the pre-print on bioRxiv, and it is getting plenty of attention.  So let’s get into it!

The MinION device

Whichever way you look at it, the MinION is a revolutionary device.  About 6 inches long, it plugs into the USB port of a Windows laptop (which is mandatory, to enable the proper running of the device). As you will know, the MinION measures single strands of DNA as they pass through protein nanopores, and is capable of ultra-long reads – over 100Kb have been reported.

It is very easy to imagine applications of this device which involve mobile sequencing – for example, picture a vet sat in a barn next to a sick animal, taking blood and sequencing DNA in real time to discover which pathogen is causing the illness.

However, as an early acecss user of the MinION, it became very clear that we would need to develop some software to begin to approach this vision.  The MinION is clearly aimed at non-experts, biologists who want to use it for rapid sequencing “in the field”.  To enable this, we need to give them software they can use, and that software needs to work on the WIndows laptop itself.  Relying on uploading the data to a server and analysing it there is not a solution that can work for “in the field” sequencing (people have said to me that you can access the internet anywhere now;  they are often city folk, who don’t realise that 3G/4G coverage doesn’t extend to huge swathes of the countryside, nevermind 3rd world countries who don’t have the infrastructure)

The MinION workflow

Once sample has been applied to the MinION, and DNA molecules pass through nanopores on the flowcell, data collection begins.  If a channel passes a number of QC metrics, each sequence read is written to a file.  Due to a hairpin adapter, each DNA molecule can be read twice, termed “template” and “complement” reads.  Whether or not the molecule is read only once or twice, the raw data are written to a file on the laptop’s hard disk drive.

This directory is monitored by an agent, and once new files are discovered, they are queued for upload to metrichor, a cloud-based base-caller which takes the raw data and calls the nucleotide sequence.  The base-called files are downloaded and stored in a sub-folder.

The organisation challenge

Here the challenge begins.  All files produced by the MinION are in HDF5 format, a binary hierarchical data format.  These files require specialised software to open.  In addition, the MinION dumps all data files from multiple runs into a single directory, embedding run information within the HDF5 file (called .fast5 files).  Each run can produce 10,000-30,000 files, and therefore very quickly users are presented with a directory with 100,000s of files in them and no way or organising them into run folders.  Finally, there have been and continue to be multiple versions of metrichor (the base-caller). Therefore each data file can be base-called multiple times.  This is quite a large and complex data set.

The data extraction challenge

Lots of information is embedded within each .fast5 file, including run metrics and the nucleotide sequences that most people are interested in, and these data can only be extracted programmatically using the HDF5 library.  This represents a challenge for bioinformaticians, nevermind biologists (or vets!).


As a package for R, poRe runs on Windows and has an incredibly simple installation path (just two additional libraries need to be installed manually).  poRe enables users to organise the MinION directory into run folders, and use the version of the metrichor base-caller as a sub-folder.  poRe also allows extraction of the sequence data as fastq, and collects the run statistics into an easy-to-use data.frame within R.  A number of plots are built in, and of course users can plot their own graphs if they wish.  poRe has also been tested on Linux, and at least one user is using it on Mac.

Check it out!  Tutorial here; software here.


poRe screenshots

Citations are not a measure of quality

Edited 30/07/2014 – I got confused about who wrote what and when – apologies to Embriette and Jonathan

There’s a lot of discussion about Neil Hall’s paper on the Kardashian index, and this is classic Neil stuff – funny and provocative.  I’m not going to discuss this paper directly, enough are doing it already.

Over on the microBEnet blog, Embriette Hyde states:

A high K index indicates that said scientist may have built their reputation on a shaky foundation (i.e. the Kardashians, circled in Figure from the paper below), while a low K index indicates that said scientist is not being given credit where credit is due.

I think we’re coming dangerously close to making judgements about quality, and what we’re actually measuring is citations – and they’re not the same thing!

For me, the number of citations a particular paper gets is based on:

  1. The size of the field e.g. if you work in cancer (a large field), your paper has a higher probability of being cited than if you work on an obscure fungus that only grows in Lithuanian coal mines (probably a very small field, but I may be wrong).
  2. Given the size of the field, next comes the relevance of your work.  So, sticking with the cancer example, you may work with a high prevalence cancer, such as lung cancer (70,000 deaths last year in the UK) or cancer of the penis (111 deaths in the UK last year)
  3. Then, given the size of the field, and the importance of the work, comes relevance.  Perhaps you discovered an oncogene that affects 60% of the population; alternatively you have found an oncogene that affects 2% of a small tribe from the Peruvian rainforest.
  4. Finally comes the quality of your work.  It is quite easy to imagine someone who does incredibly high quality work in a niche area getting far fewer citations than someone who does really crappy work in a larger, more important field

To be fair to Neil, and Embriette, neither directly state that citations = quality, in fact Neil deliberately uses the phrase “scientific value”, which is ill defined.

However, the fact still remains that what Neil is implying is that scientists with low numbers of citations are somehow less than those with lots of citations.  And that’s just not right.

And Embriette’s wording implies that scientists with low citations (and high numbers of Twitter followers) are on “shaky foundations”.  I say the assumptions behind that wording are on shaky foundations!

Some really very high quality work never gets cited (we can argue another time about the value of such research).  And some really quite shoddy work gets huge numbers of citations.  So I get to have the last word:  number of citations is not a measure of quality.


You’re not allowed bioinformatics anymore

Ah welcome! Come in, come in!” said the institute director as Professor Smith appeared for their scheduled 2pm meeting. “I want to talk to you about your latest proposal”, the director continued.

“Oh?” replied Smith.

“Yes. Now, let’s see. It’s an amazing, visionary proposal, a great collaboration, and congratulations on pulling it together. I just have one question” said the director “This proposal will generate a huge amount of data – how do you plan to deal with it all?”

“Oh that’s easy!” answered Smith. “It’s all on page 6. We’ve requested funds to employ a bioinformatician for the lifetime of the project. They’ll deal with all of the data” he stated, triumphantly.

The director frowned.

“I see. Do you yourself have any experience of bioinformatics?”

Smith seemed uncertain.

“Well, no…..”

“Then how will you be able to guide the bioinformatician, to ensure they are using appropriate tools? How will you train them?” the director pressed

Smith appeared perplexed by the question.

“We’ll employ someone who has already been trained, with at least a Masters in bioinformatics! They should already know what they’re doing…” Smith trailed off.

The director sighed.

“And what papers will the bioinformatician publish?”

Smith regained some confidence.

“They’ll get co-authorship on all of the papers coming out of the project. The post-docs who do the work will be first author, I will be last author and the bioinformatician will be in the middle”

The director drummed his fingers on his desk.

“What about a data management plan?”

“A what?”

“A data management plan. A plan, to manage the data. Where will it be stored? How will it be backed up? When will it be released?” the director asked

“Same as always, I guess” said Smith. “We’ll release supporting data as supplementary PDFs, and we’ll make sure we get every last publication we possibly can before releasing the full data set”

The director shifted uneasily in his seat. “And data storage?”

“Don’t IT deal with that kind of stuff?” Smith answered.

An awkward silence settled over the office. The director stared at Professor Smith. Finally he broke the silence.

“OK, so you have this bioinformatician, you give them the data, and they analyse it and they give you the results. How will you ensure that they’ve carried out reproducible science?”

“Reproducible what? What the hell are you talking about?” Smith answered angrily.

The director slammed his hand down on the desk.

“At least tell me you have a plan for dealing with the sequence data!”

“Of course!” said Smith “We’ve been doing this for years. We’ll keep the sequences in Word documents….”

an amber light started flashing on the director’s desk

“… annotate genes by highlighting the sequence in blue…”

the flashing light turned red

“… annotate promoters by highlighting the sequence in orange…”

Smith’s sentence was interrupted by a noisy klaxon suddenly going off, accompanied by a bright blue flashing light that had popped up behind the director’s chair.  Smith looked wide-eyed, terrified.

The director pressed a few buttons on his desk and the noisy alarm ceased, the blue light disappeared.

Smith, removing his hands from his ears, asked “What the hell was that?”

The director stood, walked over to the window and sighed heavily. “I’m sorry, Smith. I had a feeling this might happen. Look… this may appear harsh, but… you’re not allowed bioinformatics anymore”


“As I said. You’ve crossed the threshold. You’re not allowed bioinformatics anymore”

Smith’s mouth flapped open and shut as he tried to take in the news.

“You mean no-one will analyse my data?”

The director turned to face Smith.

“Quite the contrary, Smith. Good data will always be welcome, and yours will be treated no differently. It’s just that you won’t be in charge of the storage and analysis of it anymore. You can generate the data, but that will be the end of your involvement. The data will be passed to a bioinformatics group who know what to do with it.”

Smith was furious.

“Are you insane? That’s my data! I can do whatever I like with it! Bioinformaticians won’t know what to do with it anyway!”

“On the contrary” replied the director “It’s not your data. Your research is funded by the government, which is in turn funded by the tax payer. The data belong in the public domain. As for bioinformaticians, they’re scientists too and they’ll be able to analyse your data just as well as you can, probably better”

“I’ve never heard anything so ridiculous! Who decided that I’m not allowed bioinformatics anymore?”

“The Universe.”

“The Universe? Why should the Universe say I’m not allowed bioinformatics anymore?”

“Because you haven’t paid bioinformatics enough attention. It’s not a support service, at your beck and call. It’s a science. Bioinformaticians are scientists too. Young bioinformaticians need support, guidance and training; something you’re clearly not qualified to provide. They also need first-author papers to advance their careers”

“I don’t understand. What do you mean, they’re not support?!” spluttered Smith.

The director continued regardless of the interruption.

“You’ve had the opportunity to learn about bioinformatics. We’ve had a bioinformatics research group at the institute for over ten years, yet you only ever speak to them at the end of a project when you’ve already generated the data and need their help!”

“The bioinformatics group?! They’re just a bunch of computer junkies!”

The director was beginning to get angry.

“Quite the opposite. They publish multiple research papers every year, and consistently bring in funding. More than your group, actually”.

Smith looked stunned.

“But, but, but… how can this be possible? You’ll never get away with this!”

“I’m afraid I can and I will” said the director. “Science has changed, Smith. It’s a brave new world out there. Bioinformatics is key to the success of many major research programmes and bioinformaticians are now driving those programmes. Those researchers who embrace bioinformatics as a new and exciting science will be successful and those that don’t will be left behind.”

The director stared pointedly at Professor Smith. Smith was defeated, but still defiant.

“It doesn’t matter. We have tons of data we haven’t published yet. I’ll be able to work on that for decades! I don’t need new data, I have plenty of existing data”.

A smile flittered at the corners of the director’s mouth.

“Here’s the thing, Smith. As soon as that alarm went off, all of your data were zipped into a .tar.gz archive and uploaded to the cloud. It’s no longer in your possession”.

Smith looked horrified.

“What’s the cloud? How do I access it? What is a .tar.gz file and how do I open it?”

“You know” said the director “keep asking questions like that, and you might get bioinformatics back”

If you are leading a project that creates huge amounts of data, instead of employing a bioinformatician in your own group, why not collaborate with an existing bioinformatics group and fund a post there? The bioinformatician will benefit hugely from being around more knowledgeable computational biologists, and will still be dedicated to your project.

The above was hugely Inspired by “Ballantyne T (2012) If only … Nature 489(7414):170-170”.  I hope Tony doesn’t mind.