-
Notifications
You must be signed in to change notification settings - Fork 34
/
Copy path00-r-rstudio-intro.Rmd
508 lines (380 loc) · 17 KB
/
00-r-rstudio-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
---
layout: page
title: RNA-seq analysis in R
subtitle: R for RNAseq
minutes: 40
---
```{r, include=FALSE}
source("tools/chunk-options.R")
```
## Introduction to RStudio
We'll be using RStudio: a free, open source R integrated development
environment. It provides a built in editor, works on all platforms (including
on servers) and provides many advantages such as integration with version
control and project management.
**Basic layout**
When you first open RStudio, you will be greeted by three panels:
* The interactive R console (entire left)
* Environment/History (tabbed in upper right)
* Files/Plots/Packages/Help/Viewer (tabbed in lower right)
Once you open files, such as R scripts, an editor panel will also open
in the top left.
## Work flow within RStudio
There are two main ways one can work within RStudio.
1. Test and play within the interactive R console then copy code into
a .R file to run later.
* This works well when doing small tests and initially starting off.
* It quickly becomes laborious
2. Start writing in an .R file and use RStudio's command / short cut
to push current line, selected lines or modified lines to the
interactive R console.
* This is a great way to start; all your code is saved for later
* You will be able to run the file you create from within RStudio
or using R's `source()` function.
> ## Tip: Running segments of your code {.callout}
>
> RStudio offers you great flexibility in running code from within the editor
> window. There are buttons, menu choices, and keyboard shortcuts. To run the
> current line, you can 1. click on the `Run` button just above the editor panel,
> or 2. select "Run Lines" from the "Code" menu, or 3. hit Ctrl-Enter in Windows
> or Linux or Command-Enter on OS X. (This shortcut can also be seen by hovering
> the mouse over the button). To run a block of code, select it and then `Run`.
> If you have modified a line of code within a block of code you have just run,
> there is no need to reselect the section and `Run`, you can use the next button
> along, `Re-run the previous region`. This will run the previous code block
> including the modifications you have made.
## Your RStudio environment
Much of your time in R will be spent in the R interactive
console. This is where you will run all of your code, and can be a
useful environment to try out ideas before adding them to an R script
file. This console in RStudio is the same as the one you would get if
you just typed in `R` in your commandline environment.
## R Packages
It is possible to add functions to R by writing a package, or by
obtaining a package written by someone else. As of this writing, there
are over 7,000 packages available on CRAN (the comprehensive R archive
network). R and RStudio have functionality for managing packages:
* You can see what packages are installed by typing
`installed.packages()`
* You can install packages by typing `install.packages("packagename")`,
where `packagename` is the package name, in quotes.
* You can update installed packages by typing `update.packages()`
* You can remove a package with `remove.packages("packagename")`
* You can make a package available for use with `library(packagename)`
For this workshop we will also be using packages from
These can all be obtained from Bioconductor.
Open RStudio and run the following commands to install packages from [Bioconductor](https://www.bioconductor.org/). These are installed slightly differently. For example, to install the package `limma`:
```{r, eval=FALSE}
source("http://bioconductor.org/biocLite.R")
biocLite("limma")
```
## RStudio project management
The scientific process is naturally incremental, and many projects
start life as random notes, some code, then a manuscript, and
eventually everything is a bit mixed together.
<blockquote class="twitter-tweet"><p>Managing your projects in a reproducible fashion doesn't just make your science reproducible, it makes your life easier.</p>— Vince Buffalo <a href="https://twitter.com/vsbuffalo/status/323638476153167872">April 15, 2013</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Most people tend to organize their projects like this:
![](fig/bad_layout.png)
There are many reasons why we should *ALWAYS* avoid this:
1. It is really hard to tell which version of your data is
the original and which is the modified;
2. It gets really messy because it mixes files with various
extensions together;
3. It probably takes you a lot of time to actually find
things, and relate the correct figures to the exact code
that has been used to generate it;
A good project layout will ultimately make your life easier:
* It will help ensure the integrity of your data;
* It makes it simpler to share your code with someone else
(a lab-mate, collaborator, or supervisor);
* It allows you to easily upload your code with your manuscript submission;
* It makes it easier to pick the project back up after a break.
## A possible solution
Fortunately, there are tools and packages which can help you manage your work effectively.
One of the most powerful and useful aspects of RStudio is its project management
functionality. We'll be using this today to create a self-contained, reproducible
project.
> ## Challenge: Creating a self-contained project {.challenge}
>
> We're going to create a new project in RStudio:
>
> 1. Click the "File" menu button, then "New Project".
> 2. Click "New Directory".
> 3. Click "Empty Project".
> 4. Type in the name of the directory to store your project, e.g. "my_project", or better yet use the name of this workshop.
> 5. Click the "Create Project" button.
>
Now when we start R in this project directory, or open this project with RStudio,
all of our work on this project will be entirely self-contained in this directory.
## Best practices for project organisation
Although there is no "best" way to lay out a project, there are some general
principles to adhere to that will make project management easier:
## Treat data as read only
This is probably the most important goal of setting up a project. Data is
typically time consuming and/or expensive to collect. Working with them
interactively (e.g., in Excel) where they can be modified means you are never
sure of where the data came from, or how it has been modified since collection.
It is therefore a good idea to treat your data as "read-only".
## Data Cleaning
In many cases your data will be "dirty": it will need significant preprocessing
to get into a format R (or any other programming language) will find useful. This
task is sometimes called "data munging". I find it useful to store these scripts
in a separate folder, and create a second "read-only" data folder to hold the
"cleaned" data sets.
## Treat generated output as disposable
Anything generated by your scripts should be treated as disposable: it should
all be able to be regenerated from your scripts.
There are lots of different was to manage this output. I find it useful to
have an output folder with different sub-directories for each separate
analysis. This makes it easier later, as many of my analyses are exploratory
and don't end up being used in the final project, and some of the analyses
get shared between projects.
## Separate function definition and application
The most effective way I find to work in R, is to play around in the interactive
session, then copy commands across to a script file when I'm sure they work and
do what I want. You can also save all the commands you've entered using the
`history` command, but I don't find it useful because when I'm typing its 90%
trial and error.
When your project is new and shiny, the script file usually contains many lines
of directly executed code. As it matures, reusable chunks get pulled into their
own functions. It's a good idea to separate these into separate folders; one
to store useful functions that you'll reuse across analyses and projects, and
one to store the analysis scripts.
> ## Tip: avoiding duplication {.callout}
>
> You may find yourself using data or analysis scripts across several projects.
> Typically you want to avoid duplication to save space and avoid having to
> make updates to code in multiple places.
>
> In this case I find it useful to make "symbolic links", which are essentially
> shortcuts to files somewhere else on a filesystem. On Linux and OS X you can
> use the `ln -s` command, and on windows you can either create a shortcut or
> use the `mklink` command from the windows terminal.
>
## Save the data in the data directory
Now we have a good directory structure we will now place/save the data file in the `data/` directory.
Download the RNAseq data for this workshop.
- Mouse mammary data (counts): [https://figshare.com/s/1d788fd384d33e913a2a](https://figshare.com/s/1d788fd384d33e913a2a)
- Drosophila data (counts): [https://figshare.com/s/e08e71c42f118dbe8be6](https://figshare.com/s/e08e71c42f118dbe8be6)
> ## Challenge {.challenge}
>
> 1. Create a `/data` directory. In the bottom right panel select the "Files" tab,
then "New Folder", then type "data" and click "OK".
> 2. Download the RNAseq data using the links above (if you find the internet is slow, you can just download Day 1 for now).
> 3. Click "Download all" (this will download a zip file).
> 4. Unzip the file (usually double clicking on it will do the trick).
> 5. Move all the files inside into the `data/` folder within your project.
>
## Reading data into R
Our data is tab delimited, which is what read.table expect by default so we don't need to specify `delim`.
However, we do need `header = TRUE`.
```{r}
# Read the data into R
small_counts <- read.table("data/small_counts.txt", header = TRUE)
```
```{r}
print(small_counts)
dim(small_counts)
```
## Working with Data Frames
### Subsetting
Let's say we want the all the counts for Sample_1. There are three main ways we could do this:
`$` notation with the column name
```{r}
small_counts$Sample_1
```
`[row, column]` notation with numeric indices
```{r}
small_counts[, 1]
```
`[row, column]` notation using the column name (in a vector)
```{r}
small_counts[, c("Sample_1")]
```
This is handy when you want to grab multiple columns
```{r}
small_counts[, c("Sample_1", "Sample_3")]
```
You can even mix numerical and named indices
```{r}
small_counts[1:3, c("Sample_1", "Sample_3")]
```
### Vectorisation
R arithmetic operators are vectorised by default.
```{r}
small_counts$Sample_1 * 2
```
So are many R functions.
```{r}
log(small_counts)
```
Let's say we want to calculate the sum of counts for each sample.
We can do this for each sample by subsetting and then using the `sum` function.
```{r}
sum(small_counts$Sample_1)
sum(small_counts$Sample_2)
```
If you had a lot of samples, you can imagine this would get tedious.
So instead, we can use an `apply` statement to apply the `sum` function to every column at once.
In R, this is usually quicker than using a loop.
Note the `MARGIN` argument tells `apply` which direction you want to go in.
MARGIN = 1 will apply `sum` to each row, while MARGIN = 2 will apply `sum` to each column.
```{r}
# Sum the counts for each sample
sample_sums = apply(small_counts, MARGIN = 2, sum)
print(sample_sums)
```
## Data types
There are 5 main types: doubles, integers, complex, logical and character.
```{r}
typeof(3.14)
typeof(1L)
typeof(1+1i)
typeof(TRUE)
typeof('banana')
```
Note the `L` suffix to insist that a number is an integer. No matter how complicated our analyses become, all data in R is interpreted as one of these basic data types.
Let's read in some data that contains things of more than one type.
This is actually a sample of a differential expression results table that you will generate later in the workshop.
```{r}
ResultsTable_small <- read.table("data/ResultsTable_small.txt", header=TRUE)
print(ResultsTable_small)
```
We can use `str` to reveal the structure of our `ResultsTable_small`.
```{r}
str(ResultsTable_small)
```
We can see that it's a data frame, and `str` also tells us the type of each column.
Everything in a single column must be of the same type, while each column can be a different type.
This is because data frames are actually lists of vectors, and vectors must contain data of only one type.
When you try to mix types in a vector, strange things can happen.
```{r}
my_vector = c(1, "hello", TRUE)
print(my_vector)
```
Notice that everything has been converted to a character. You can check this with `typeof`.
```{r}
typeof(my_vector)
```
This is called type coercion.
When you try to create a vector containing a mixture of types R will automatically (and silently) convert everything to the same type.
Generally it will chose a type that means you don't lose any information.
The coercion rules go: `logical` -> `integer` -> `numeric` -> `complex` -> `character`.
Note, the same thing will happen if you try to read in data from a file that has a mixture of types within a single column.
### Factors
You may have noticed that `ResultsTable_small` contained a column annotated as `Factor`.
```{r}
str(ResultsTable_small)
```
Factors usually look like character data, but are typically used to represent categorical information.
```{r}
str(ResultsTable_small$SYMBOL)
```
Instead of printing out the strings we gave it, we got a bunch of numbers instead.
R has replaced our human-readable categories with numbered indices under the hood:
```{r}
typeof(ResultsTable_small$SYMBOL)
```
In modelling functions, it's important to know what the baseline levels are. This is assumed to be the
first factor, but by default factors are labelled in alphabetical order. You can change this by specifying the levels:
```{r}
mydata <- c("case", "control", "control", "case")
factor_ordering_example <- factor(mydata, levels = c("control", "case"))
str(factor_ordering_example)
```
In this case, we've explicitly told R that "control" should represented by 1, and
"case" by 2. This designation can be very important for interpreting the
results of statistical models!
If you prefer not to work with factors, you can simply tell the `read.table` function to keep all strings as characters.
```{r}
ResultsTable_small <- read.table("data/ResultsTable_small.txt", stringsAsFactors = FALSE, header=TRUE)
```
### Sorting
If you just want to sort a vector from smallest to largest, you can use the `sort` function:
```{r}
sort(ResultsTable_small$logFC)
```
To reverse the sort order so it goes from largest to smallest:
```{r}
sort(ResultsTable_small$logFC, decreasing = TRUE)
```
And this works with characters:
```{r}
sort(ResultsTable_small$SYMBOL)
```
If you want to sort the entire data frame by the values in a single column, you can use the `order` function.
This gives you the indexes for if the data was sorted. For example:
```{r}
order(ResultsTable_small$logFC)
```
You then use these indexes to sort the data:
```{r}
ResultsTable_small$logFC[order(ResultsTable_small$logFC)]
```
You can then use these indices to sort the rows:
```{r}
ResultsTable_small[order(ResultsTable_small$logFC), ]
```
### Subsetting using logical statements
Let's say we only want to look at the rows where the log fold change is greater than 3.
We can ask if each `logFC` is greater than 3.
```{r}
ResultsTable_small$logFC > 3
```
This gives us a vector of TRUE/FALSE for each value stating if it is greater than 3.
We can then use this to subset the data.
```{r}
ResultsTable_small$logFC[ResultsTable_small$logFC > 3]
```
And to subset the rows of the data frame.
```{r}
ResultsTable_small[ResultsTable_small$logFC > 3, ]
```
In fact, we probably care about things that are -3 as well, so let's include those too:
```{r}
ResultsTable_small[ResultsTable_small$logFC > 3 | ResultsTable_small$logFC < -3, ]
```
Or we can simplify this using the `abs` function to find the absolute value of each `logFC` before testing them.
```{r}
ResultsTable_small[abs(ResultsTable_small$logFC) > 3, ]
```
> ## Tip: Logical Operators in R {.callout}
>
> |Operator | Description|
> | - | - |
> |< | less than|
> |<= | less than or equal to|
> |> | greater than
> |>= | greater than or equal to|
> |== | exactly equal to|
> |!= | not equal to|
> |!x | Not x|
> |x \| y | x OR y|
> |x & y | x AND y|
> |isTRUE(x) | test if X is TRUE|
> Source: [Quick-R](http://www.statmethods.net/management/operators.html)
>
We might also want to reduce our results table down to a set of genes that we are interested in.
We can use similar logical subsetting logic to do this.
```{r}
my_genes <- c("Smad7", "Wif1", "Fam102b", "Tppp3")
```
For each gene in ResultsTable_small$SYMBOL, does it appear in my_genes?
```{r}
ResultsTable_small$SYMBOL %in% my_genes
```
```{r}
ResultsTable_small[ResultsTable_small$SYMBOL %in% my_genes, ]
```
You can use `match` to get the same subset, but this time make sure they are in the same order as in `my_genes`.
This can be useful if you are trying to merge together data from two sources.
```{r}
match(my_genes, ResultsTable_small$SYMBOL)
```
This returns the index for where each of the genes in `my_list appears` in `ResultsTable_small$SYMBOL`.
We can then use this to subset the columns from `ResultsTable_small`.
```{r}
ResultsTable_small[match(my_genes, ResultsTable_small$SYMBOL), ]
```
You can see they are in the same order as `my_genes`!