-
Notifications
You must be signed in to change notification settings - Fork 34
/
Copy path09-applying-rnaseq-solutions.Rmd
394 lines (292 loc) · 10.2 KB
/
09-applying-rnaseq-solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
---
layout: page
title: "RNA-seq analysis in R"
subtitle: "Solutions: Analysis of Pasilla Knock-down Experiment in Drosophila"
author: "Belinda Phipson and Jovana Maksimovic"
date: "21, 23 September 2016"
minutes: 15
---
**Author: Belinda Phipson and Jovana Maksimovic**
```{r, include=FALSE}
source("tools/chunk-options.R")
opts_chunk$set(fig.path = "fig-09/")
```
## Data files needed
* counts_Drosophila.txt
* SampleInfo_Drosophila.txt
Available from [https://figshare.com/s/e08e71c42f118dbe8be6](https://figshare.com/s/e08e71c42f118dbe8be6).
## Libraries needed
* limma
* edgeR
* org.Dm.eg.db
* gplots
* RColorBrewer
## Introduction
The RNA-Seq data we will be analysing today come from this published paper:
Brooks, A.N., Yang, L., Duff, M.O., Hansen, K.D., Park, J.W., Dudoit, S.,
Brenner, S.E. and Graveley, B.R. (2011) Conservation of an rna regulatory
map between drosophila and mammals. Genome Research, 21(2), 193-202.
http://www.ncbi.nlm.nih.gov/pubmed/20921232
This is a publicly available dataset, deposited in the
Short Read Archive. The RNA-sequence data are available from GEO under accession nos. GSM461176-GSM461181. The authors combined RNAi and RNASeq
to identify exons regulated by Pasilla, the Drosophila
melanogaster ortholog of mammalian NOVA1 and NOVA2.
They showed that the RNA regulatory map of Pasilla and
NOVA1/2 is highly conserved between insects and mammals.
NOVA1 and NOVA2 are best known for being involved
in alternative splicing. Cells from S2-DRSC, which is an
embryonic cell line, were cultured and subjected to a treatment
in order to knock-down Pasilla. The four untreated and
three treated RNAi samples were used in the
analysis. The treated samples had Pasilla knocked
down by approximately 60% compared to the untreated
samples. Some of the samples had undergone paired
end sequencing while other samples were sequenced from one end only.
The reads were aligned to the Drosophila reference genome,
downloaded from Ensembl, using the tophat aligner.
The reads were summarised at the gene-level using
htseq-count, a function from the tool HTSeq
(http://wwwhuber.embl.de/users/anders/HTSeq/doc/overview.html).
For the purpose of today's workshop, we will be analysing the gene level counts.
## Reading data into R
First, let's load all the libraries we will need today.
```{r, message = FALSE}
library(limma)
library(edgeR)
library(gplots)
library(RColorBrewer)
library(org.Dm.eg.db)
```
Next, read in the data and targets file:
```{r}
counts <- read.delim(file="data/counts_Drosophila.txt")
targets <- read.delim(file="data/SampleInfo_Drosophila.txt")
```
Check that the data has read in correctly
```{r}
head(counts)
targets
```
## Filtering out lowly expressed genes
Our main interest is in testing the treated versus untreated groups. To check how many samples we have in each group we can use the `table` command.
```{r}
table(targets$Group)
```
The minimum sample size is 3. Let's check the relationship between CPM and counts to see what CPM threshold we should be imposing. Recall we're looking for a CPM that corresponds to a count of roughly 10-15.
```{r}
mycpm <- cpm(counts)
plot(counts[,1],mycpm[,1],xlim=c(0,20),ylim=c(0,5))
abline(v=10,col=2)
abline(h=2,col=4)
```
We can filter on a CPM of 2 or 3 in at least 3 samples.
```{r}
thresh <- mycpm > 2
keep <- rowSums(thresh) >= 3
table(keep)
counts.keep <- counts[keep,]
dim(counts.keep)
```
We are filtering out about half our genes.
## Convert to DGEList object
```{r}
y <- DGEList(counts.keep)
```
## Quality control
Let's do a number of quality control plots.
First, check the library sizes:
```{r}
barplot(y$samples$lib.size)
```
Next check the distribution of the counts using a boxplot:
```{r}
par(mfrow=c(1,1))
# Get log2 counts per million
logcpm <- cpm(y$counts,log=TRUE)
# Check distributions of samples using boxplots
boxplot(logcpm, xlab="", ylab="Log2 counts per million",las=2,outline=FALSE)
# Let's add a blue horizontal line that corresponds to the median logCPM
abline(h=median(logcpm),col="blue")
title("Boxplots of logCPMs (unnormalised)")
```
We can colour by our groups, or by the different library prep.
```{r}
par(mfrow=c(1,2),oma=c(2,0,0,0))
group.col <- c("red","blue")[targets$Group]
boxplot(logcpm, xlab="", ylab="Log2 counts per million",las=2,col=group.col,
pars=list(cex.lab=0.8,cex.axis=0.8))
abline(h=median(logcpm),col="blue")
title("Boxplots of logCPMs\n(coloured by groups)",cex.main=0.8)
lib.col <- c("light pink","light green")[targets$Library]
boxplot(logcpm, xlab="", ylab="Log2 counts per million",las=2, col=lib.col,
pars=list(cex.lab=0.8,cex.axis=0.8))
abline(h=median(logcpm),col="blue")
title("Boxplots of logCPMs\n(coloured by library prep)",cex.main=0.8)
```
It doesn't look like there are any large biases in the data.
Finally, let's check our MDS plots.
```{r}
par(mfrow=c(1,1))
plotMDS(y)
```
```{r}
par(mfrow=c(1,2))
plotMDS(y,col=group.col)
legend("topright",legend=levels(targets$Group),fill=c("red","blue"))
plotMDS(y,col=lib.col)
legend("topleft",legend=levels(targets$Library),fill=c("light pink","light green"))
```
It looks like there is some variability in the data due to library type e.g. single end or paired end.
## Hierarchical clustering with heatmap.2
First we need a matrix of log counts:
```{r}
logcounts <- cpm(y,log=TRUE)
```
Get variances for genes:
```{r}
var_genes <- apply(logcounts, 1, var)
```
Get top 500 most variable
```{r}
select_var <- names(sort(var_genes, decreasing=TRUE))[1:500]
```
```{r}
highly_variable_lcpm <- logcounts[select_var,]
dim(highly_variable_lcpm)
```
```{r}
mypalette <- brewer.pal(11,"RdYlBu")
morecols <- colorRampPalette(mypalette)
# Plot the heatmap
heatmap.2(highly_variable_lcpm,col=rev(morecols(50)),trace="none", main="Top 500 most variable genes across samples",ColSideColors=group.col,scale="row",margins=c(10,5))
heatmap.2(highly_variable_lcpm,col=rev(morecols(50)),trace="none", main="Top 500 most variable genes across samples",ColSideColors=lib.col,scale="row",margins=c(10,5))
```
## Normalisation
Let's do TMM normalisation
```{r}
y <- calcNormFactors(y)
y$samples
```
```{r}
par(mfrow=c(1,2))
plotMD(logcounts,column=2)
abline(h=0,col="grey")
plotMD(y,column = 2)
abline(h=0,col="grey")
```
## Differential expression
### Set up design matrix
We want to test for differences between the treated and untreated samples. However, we know that the library preparation adds variability to the data, so we need to account for it in our model. We do this by modelling both Group and Library as variables in our design matrix. This is known as an additive model.
```{r}
design <- model.matrix(~targets$Library + targets$Group)
design
colnames(design) <- c("Int","SEvsPE","UVsT")
```
### Voom transform the data
```{r}
par(mfrow=c(1,1))
v <- voom(y,design,plot=TRUE)
```
```{r}
par(mfrow=c(1,2))
boxplot(logcounts)
abline(h=median(logcounts),col=4)
boxplot(v$E)
abline(h=median(v$E),col=4)
```
### Test for differential expression
```{r}
fit <- lmFit(v,design)
fit <- eBayes(fit)
results <- decideTests(fit)
summary(results)
```
```{r}
topTable(fit,coef=3,sort.by="p")
```
### Add annotation from org.Dm.eg.db
The rownames of the fit object are [http://flybase.org/](FlyBase) ids.
First we need to decide what information we want. In order to see what we can extract we can run the `columns` function on the annotation database.
```{r}
columns(org.Dm.eg.db)
```
We definitely want gene symbols and perhaps the full gene name and Entrez id. Let's build up our annotation information in a separate data frame using the `select` function. Note, by default, the `select` function assumes that the keys provided are Entrez ids. However, in this case we are using FlyBase ids as the keys. As such, we need to give the `select` function this information using the `keytype` argument which we will set to `"FLYBASE"`.
```{r}
ann <- select(org.Dm.eg.db,keys=rownames(fit),columns=c("FLYBASE","ENTREZID","SYMBOL","GENENAME"),keytype="FLYBASE")
# Have a look at the annotation
head(ann)
```
Let's double check that the `FLYBASE` column matches exactly to our `fit` rownames.
```{r}
table(ann$FLYBASE==rownames(fit))
```
We can slot in the annotation information into the `genes` slot of `fit`. (Please note that if the `select` function returns a 1:many mapping then you can't just append the annotation to the fit object. An alternative way to get annotation is shown below.)
```{r}
fit$genes <- ann
```
Now when we run the `topTable` command, the annotation information should be included in the output.
```{r}
topTable(fit,coef=3,sort.by="p")
```
If for some reason the `select` function doesn't work, or the result is a 1:many mapping, there is a less elegant way to get the annotations using the `toTable` command.
```{r}
# Let's see what we can get in table format from org.Dm.eg.db
ls("package:org.Dm.eg.db")
# Get annotation
fly <- toTable(org.Dm.egFLYBASE)
head(fly)
symbol <- toTable(org.Dm.egSYMBOL)
genename <- toTable(org.Dm.egGENENAME)
# We can use the merge command to merge two dataframes
ann1 <- merge(fly,symbol,by="gene_id")
head(ann1)
# Add genename table to ann1
ann2 <- merge(ann1,genename,by="gene_id")
head(ann2)
```
Now we need to match up between `rownames(fit)` and the ensemble gene id in `ann2`.
```{r}
m <- match(rownames(fit),ann2$flybase_id)
table(is.na(m)) # check for unmatched rows
ann3 <- ann2[m[!is.na(m)],] # exclude unmatched rows
# compare the results in ann3 to what is in the fit object
head(ann3)
head(fit$genes)
```
```{r}
topTable(fit,coef=3,sort.by="p")
```
Let's check the expression of pasilla.
```{r}
ps <- grep("pasilla",fit$genes$GENENAME)
topTable(fit[ps,],coef=3)
```
## Plots after testing for DE
```{r}
par(mfrow=c(1,2))
plotMD(fit,coef=3,status=results[,"UVsT"])
volcanoplot(fit,coef=3,highlight=100,names=fit$genes$SYMBOL)
```
```{r}
stripchart(v$E["FBgn0025111",]~targets$Group)
```
```{r}
# Check expression of Pasilla
stripchart(v$E["FBgn0261552",]~targets$Group)
```
## Testing relative to a threshold
```{r}
fit.treat <- treat(fit,lfc=1)
res.treat <- decideTests(fit.treat)
summary(res.treat)
topTreat(fit.treat,coef=3)
```
```{r}
plotMD(fit.treat,coef=3,status=res.treat[,"UVsT"])
abline(h=0,col="grey")
```
## Gene set testing with goana
```{r}
go <- goana(fit, coef="UVsT",species = "Dm", geneid="ENTREZID")
topGO(go, n=10)
```