09-applying-rnaseq-solutions.Rmd

---
layout: page
title: "RNA-seq analysis in R"
subtitle: "Solutions: Analysis of Pasilla Knock-down Experiment in Drosophila"
author: "Belinda Phipson and Jovana Maksimovic"
date: "21, 23 September 2016"
minutes: 15
---

**Author: Belinda Phipson and Jovana Maksimovic**

```{r, include=FALSE}
source("tools/chunk-options.R")
opts_chunk$set(fig.path = "fig-09/")

```

## Data files needed

* counts_Drosophila.txt
* SampleInfo_Drosophila.txt

Available from [https://figshare.com/s/e08e71c42f118dbe8be6](https://figshare.com/s/e08e71c42f118dbe8be6).

## Libraries needed

* limma
* edgeR
* org.Dm.eg.db
* gplots
* RColorBrewer

## Introduction

The RNA-Seq data we will be analysing today come from this published paper:

Brooks, A.N., Yang, L., Duff, M.O., Hansen, K.D., Park, J.W., Dudoit, S.,
Brenner, S.E. and Graveley, B.R. (2011) Conservation of an rna regulatory
map between drosophila and mammals. Genome Research, 21(2), 193-202.  
http://www.ncbi.nlm.nih.gov/pubmed/20921232

This is a publicly available dataset, deposited in the
Short Read Archive.  The RNA-sequence data are available from GEO under accession nos. GSM461176-GSM461181.  The authors combined RNAi and RNASeq
to identify exons regulated by Pasilla, the Drosophila
melanogaster ortholog of mammalian NOVA1 and NOVA2.
They showed that the RNA regulatory map of Pasilla and
NOVA1/2 is highly conserved between insects and mammals.
NOVA1 and NOVA2 are best known for being involved
in alternative splicing. Cells from S2-DRSC, which is an
embryonic cell line, were cultured and subjected to a treatment
in order to knock-down Pasilla. The four untreated and
three treated RNAi samples were used in the
analysis. The treated samples had Pasilla knocked
down by approximately 60% compared to the untreated
samples. Some of the samples had undergone paired
end sequencing while other samples were sequenced from one end only.

The reads were aligned to the Drosophila reference genome,
downloaded from Ensembl, using the tophat aligner.
The reads were summarised at the gene-level using
htseq-count, a function from the tool HTSeq
(http://wwwhuber.embl.de/users/anders/HTSeq/doc/overview.html).

For the purpose of today's workshop, we will be analysing the gene level counts.

## Reading data into R

First, let's load all the libraries we will need today.

```{r, message = FALSE}
library(limma)
library(edgeR)
library(gplots)
library(RColorBrewer)
library(org.Dm.eg.db)
```

Next, read in the data and targets file:

```{r}
counts <- read.delim(file="data/counts_Drosophila.txt")
targets <- read.delim(file="data/SampleInfo_Drosophila.txt")
```

Check that the data has read in correctly

```{r}
head(counts)
targets
```

## Filtering out lowly expressed genes

Our main interest is in testing the treated versus untreated groups. To check how many samples we have in each group we can use the `table` command.

```{r}
table(targets$Group)
```

The minimum sample size is 3. Let's check the relationship between CPM and counts to see what CPM threshold we should be imposing. Recall we're looking for a CPM that corresponds to a count of roughly 10-15.

```{r}
mycpm <- cpm(counts)
plot(counts[,1],mycpm[,1],xlim=c(0,20),ylim=c(0,5))
abline(v=10,col=2)
abline(h=2,col=4)
```

We can filter on a CPM of 2 or 3 in at least 3 samples.

```{r}
thresh <- mycpm > 2
keep <- rowSums(thresh) >= 3
table(keep)

counts.keep <- counts[keep,]
dim(counts.keep)
```

We are filtering out about half our genes.

## Convert to DGEList object

```{r}
y <- DGEList(counts.keep)
```

## Quality control

Let's do a number of quality control plots.

First, check the library sizes:

```{r}
barplot(y$samples$lib.size)
```

Next check the distribution of the counts using a boxplot:

```{r}
par(mfrow=c(1,1))
# Get log2 counts per million
logcpm <- cpm(y$counts,log=TRUE)
# Check distributions of samples using boxplots
boxplot(logcpm, xlab="", ylab="Log2 counts per million",las=2,outline=FALSE)
# Let's add a blue horizontal line that corresponds to the median logCPM
abline(h=median(logcpm),col="blue")
title("Boxplots of logCPMs (unnormalised)")
```

We can colour by our groups, or by the different library prep.

```{r}
par(mfrow=c(1,2),oma=c(2,0,0,0))
group.col <- c("red","blue")[targets$Group]
boxplot(logcpm, xlab="", ylab="Log2 counts per million",las=2,col=group.col,
        pars=list(cex.lab=0.8,cex.axis=0.8))
abline(h=median(logcpm),col="blue")
title("Boxplots of logCPMs\n(coloured by groups)",cex.main=0.8)

lib.col <- c("light pink","light green")[targets$Library]
boxplot(logcpm, xlab="", ylab="Log2 counts per million",las=2, col=lib.col,
        pars=list(cex.lab=0.8,cex.axis=0.8))
abline(h=median(logcpm),col="blue")
title("Boxplots of logCPMs\n(coloured by library prep)",cex.main=0.8)
```

It doesn't look like there are any large biases in the data.

Finally, let's check our MDS plots.

```{r}
par(mfrow=c(1,1))
plotMDS(y)
```

```{r}
par(mfrow=c(1,2))
plotMDS(y,col=group.col)
legend("topright",legend=levels(targets$Group),fill=c("red","blue"))
plotMDS(y,col=lib.col)
legend("topleft",legend=levels(targets$Library),fill=c("light pink","light green"))
```

It looks like there is some variability in the data due to library type e.g. single end or paired end.

## Hierarchical clustering with heatmap.2

First we need a matrix of log counts:

```{r}
logcounts <- cpm(y,log=TRUE)
```

Get variances for genes:

```{r}
var_genes <- apply(logcounts, 1, var)
```

Get top 500 most variable

```{r}
select_var <- names(sort(var_genes, decreasing=TRUE))[1:500]
```

```{r}
highly_variable_lcpm <- logcounts[select_var,]
dim(highly_variable_lcpm)
```

```{r}
mypalette <- brewer.pal(11,"RdYlBu")
morecols <- colorRampPalette(mypalette)
# Plot the heatmap
heatmap.2(highly_variable_lcpm,col=rev(morecols(50)),trace="none", main="Top 500 most variable genes across samples",ColSideColors=group.col,scale="row",margins=c(10,5))

heatmap.2(highly_variable_lcpm,col=rev(morecols(50)),trace="none", main="Top 500 most variable genes across samples",ColSideColors=lib.col,scale="row",margins=c(10,5))
```

## Normalisation

Let's do TMM normalisation

```{r}
y <- calcNormFactors(y)
y$samples
```

```{r}
par(mfrow=c(1,2))
plotMD(logcounts,column=2)
abline(h=0,col="grey")
plotMD(y,column = 2)
abline(h=0,col="grey")
```

## Differential expression

### Set up design matrix

We want to test for differences between the treated and untreated samples. However, we know that the library preparation adds variability to the data, so we need to account for it in our model. We do this by modelling both Group and Library as variables in our design matrix. This is known as an additive model.

```{r}
design <- model.matrix(~targets$Library + targets$Group)
design
colnames(design) <- c("Int","SEvsPE","UVsT")
```

### Voom transform the data

```{r}
par(mfrow=c(1,1))
v <- voom(y,design,plot=TRUE)
```

```{r}
par(mfrow=c(1,2))
boxplot(logcounts)
abline(h=median(logcounts),col=4)
boxplot(v$E)
abline(h=median(v$E),col=4)
```

### Test for differential expression

```{r}
fit <- lmFit(v,design)
fit <- eBayes(fit)
results <- decideTests(fit)
summary(results)
```

```{r}
topTable(fit,coef=3,sort.by="p")
```

### Add annotation from org.Dm.eg.db

The rownames of the fit object are [http://flybase.org/](FlyBase) ids.

First we need to decide what information we want. In order to see what we can extract we can run the `columns` function on the annotation database.

```{r}
columns(org.Dm.eg.db)
```

We definitely want gene symbols and perhaps the full gene name and Entrez id. Let's build up our annotation information in a separate data frame using the `select` function. Note, by default, the `select` function assumes that the keys provided are Entrez ids. However, in this case we are using FlyBase ids as the keys. As such, we need to give the `select` function this information using the `keytype` argument which we will set to `"FLYBASE"`.

```{r}
ann <- select(org.Dm.eg.db,keys=rownames(fit),columns=c("FLYBASE","ENTREZID","SYMBOL","GENENAME"),keytype="FLYBASE")
# Have a look at the annotation
head(ann)
```

Let's double check that the `FLYBASE` column matches exactly to our `fit` rownames.

```{r}
table(ann$FLYBASE==rownames(fit))
```

We can slot in the annotation information into the `genes` slot of `fit`. (Please note that if the `select` function returns a 1:many mapping then you can't just append the annotation to the fit object. An alternative way to get annotation is shown below.)

```{r}
fit$genes <- ann
```

Now when we run the `topTable` command, the annotation information should be included in the output.

```{r}
topTable(fit,coef=3,sort.by="p")
```

If for some reason the `select` function doesn't work, or the result is a 1:many mapping, there is a less elegant way to get the annotations using the `toTable` command.

```{r}
# Let's see what we can get in table format from org.Dm.eg.db
ls("package:org.Dm.eg.db")
# Get annotation
fly <- toTable(org.Dm.egFLYBASE)
head(fly)
symbol <- toTable(org.Dm.egSYMBOL)
genename <- toTable(org.Dm.egGENENAME)

# We can use the merge command to merge two dataframes
ann1 <- merge(fly,symbol,by="gene_id")
head(ann1)
# Add genename table to ann1
ann2 <- merge(ann1,genename,by="gene_id")
head(ann2)
```

Now we need to match up between `rownames(fit)` and the ensemble gene id in `ann2`.

```{r}
m <- match(rownames(fit),ann2$flybase_id)
table(is.na(m)) # check for unmatched rows
ann3 <- ann2[m[!is.na(m)],] # exclude unmatched rows

# compare the results in ann3 to what is in the fit object
head(ann3)
head(fit$genes)
```

```{r}
topTable(fit,coef=3,sort.by="p")
```

Let's check the expression of pasilla.

```{r}
ps <- grep("pasilla",fit$genes$GENENAME)
topTable(fit[ps,],coef=3)
```

## Plots after testing for DE

```{r}
par(mfrow=c(1,2))
plotMD(fit,coef=3,status=results[,"UVsT"])

volcanoplot(fit,coef=3,highlight=100,names=fit$genes$SYMBOL)
```

```{r}
stripchart(v$E["FBgn0025111",]~targets$Group)
```

```{r}
# Check expression of Pasilla
stripchart(v$E["FBgn0261552",]~targets$Group)
```


## Testing relative to a threshold

```{r}
fit.treat <- treat(fit,lfc=1)
res.treat <- decideTests(fit.treat)
summary(res.treat)
topTreat(fit.treat,coef=3)
```

```{r}
plotMD(fit.treat,coef=3,status=res.treat[,"UVsT"])
abline(h=0,col="grey")
```

## Gene set testing with goana

```{r}
go <- goana(fit, coef="UVsT",species = "Dm", geneid="ENTREZID")
topGO(go, n=10)
```