**Simulating the perfect scenario**

Let’s get our hands dirty in R. I’m going to simulate 10 genes, each between 200 and 10000 nucleotides long and with between 0 and 50000 molecules in the mix:

gmat <- data.frame(id=paste("G",1:10,sep=""),
length=as.integer(runif(10, min=200, max=10001)),
nmol=as.integer(runif(10, min=0, max=50001)),
stringsAsFactors=FALSE)

This is what my data look like

> gmat
id length nmol
1 G1 5432 3206
2 G2 7177 43621
3 G3 6189 48670
4 G4 8529 635
5 G5 2576 27919
6 G6 8462 33280
7 G7 4316 18735
8 G8 3600 41457
9 G9 4973 21935
10 G10 9742 26207

What I want to do now is calculate the total amount of sequence space each gene represents, and use that to calculate a probability of RNA-Seq reads coming from that gene. A simple example: if there are 10 copies of a gene in the mix, each of 1000 nucleotides long, then the sequence space of that gene is 10,000 nucleotides.

gmat$sspace <- gmat$length * gmat$nmol
ssum <- sum(gmat$sspace)
gmat$p <- gmat$sspace/ssum

My data now look like this:

> gmat
id length nmol sspace p
1 G1 5432 3206 17414992 0.01098634
2 G2 7177 43621 313067917 0.19750063
3 G3 6189 48670 301218630 0.19002544
4 G4 8529 635 5415915 0.00341666
5 G5 2576 27919 71919344 0.04537072
6 G6 8462 33280 281615360 0.17765861
7 G7 4316 18735 80860260 0.05101114
8 G8 3600 41457 149245200 0.09415216
9 G9 4973 21935 109082755 0.06881546
10 G10 9742 26207 255308594 0.16106284

My p-values should sum to one, and they do:

> sum(gmat$p)
[1] 1

Now, let’s assume that we have 30 million RNA-Seq reads, and that those reads have an equal chance of coming from any of the copies of any of the genes in the mix. Therefore, 30million multiplied by my p-values will give me my RNA-Seq counts:

gmat$nreads <- round(30000000 * gmat$p)

And I can now calculate my FPKM values:

gmat$fpkm <- (gmat$nreads / (gmat$length / 1000)) / 30

My data now look like this:

> gmat
id length nmol sspace p nreads fpkm
1 G1 5432 3206 17414992 0.01098634 329590 2022.5209
2 G2 7177 43621 313067917 0.19750063 5925019 27518.5509
3 G3 6189 48670 301218630 0.19002544 5700763 30703.7388
4 G4 8529 635 5415915 0.00341666 102500 400.5941
5 G5 2576 27919 71919344 0.04537072 1361121 17612.8500
6 G6 8462 33280 281615360 0.17765861 5329758 20994.8719
7 G7 4316 18735 80860260 0.05101114 1530334 11819.0767
8 G8 3600 41457 149245200 0.09415216 2824565 26153.3805
9 G9 4973 21935 109082755 0.06881546 2064464 13837.8180
10 G10 9742 26207 255308594 0.16106284 4831885 16532.8309

Satisfyingly, I have a nice linear relationship between the number of molecules in the mix, and my calculated FPKM:

We can calculate the relationship between these (forcing the model through the origin):

> lm(nmol ~ 0 + fpkm, data = gmat)
Call:
lm(formula = nmol ~ 0 + fpkm, data = gmat)
Coefficients:
fpkm
1.585

So, in our model, nmol = 1.585 * fpkm, and we can add this line to the plot:

p + geom_point() + geom_abline(intercept=0, slope=1.585, col="red")

(please do not use this formula more generically – it really only works in this simple simulation!)

So for perfect simulated data, the number of molecules (transcripts) in the original mix correlates perfectly with FPKM.

(unless I have made any huge errors or incorrect assumptions, which is possible)