-
Notifications
You must be signed in to change notification settings - Fork 34
/
Copy path07-rnaseq-day2.html
592 lines (584 loc) · 43 KB
/
07-rnaseq-day2.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>RNA-seq analysis in R</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<a href="index.html"><h1 class="title">RNA-seq analysis in R</h1></a>
<h2 class="subtitle">Alignment and feature counting</h2>
<p><strong>Authors: Belinda Phipson, Maria Doyle, Harriet Dashnow</strong></p>
<p>This material has been created using the following sources:<br />
http://www.statsci.org/smyth/pubs/QLedgeRPreprint.pdf <span class="citation">(Lun, Chen, and Smyth 2016)</span> http://bioinf.wehi.edu.au/RNAseqCaseStudy/</p>
<p>Packages used:<br />
Rsubread</p>
<p>Data files needed:<br />
Mouse chromosome 1 Rsubread index files (~400MB).<br />
Targets2.txt<br />
The 12 fastq.gz files for the mouse dataset.</p>
<p>Mouse mammary data (fastq files): <a href="https://figshare.com/s/f5d63d8c265a05618137" class="uri">https://figshare.com/s/f5d63d8c265a05618137</a> You should download these files and place them in your <code>/data</code> directory.</p>
<p>GEO entry for the dataset:<br />
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60450</p>
<p>The raw reads were downloaded from SRA from the link given in GEO for the dataset (ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP%2FSRP045%2FSRP045534). These files are in .sra format. The sra toolkit from NCBI was used to convert the .sra files to .fastq files using the fastq-dump command.</p>
<h2 id="downloading-genome-files">Downloading genome files</h2>
<p>We have provided the index files for chromosome 1 for the mouse genome build mm10 for this workshop in order to save time on building the index. However, full genome fasta files for a number of different genomes are available to download from the UCSC genome browser, see http://hgdownload.soe.ucsc.edu/downloads.html; from NCBI: http://www.ncbi.nlm.nih.gov/genome; or from ENSEMBL: http://asia.ensembl.org/info/data/ftp/index.html.</p>
<h2 id="introduction-and-data-import">Introduction and data import</h2>
<p>For the purpose of this workshop, we are going to be working with a small part of the mouse reference genome (chromosome 1) to demonstrate how to do read alignment and counting using R. Mapping reads to the genome is a very important task, and many different aligners are available, such as bowtie <span class="citation">(Langmead and Salzberg 2012)</span>, topHat <span class="citation">(Trapnell, Pachter, and Salzberg 2009)</span>, STAR <span class="citation">(Dobin et al. 2013)</span> and Rsubread <span class="citation">(Liao, Smyth, and Shi 2013)</span>. Rsubread is the only aligner that can run in R. Most alignment tools are run in a linux environment, and they are very computationally intensive. Most mapping tasks require larger computers than an average laptop, so usually read mapping is done on a server in a linux-like environment. Here we are only going to be mapping 1000 reads from each sample from our mouse lactation dataset <span class="citation">(Fu et al. 2015)</span>, and we will only be mapping to chromosome 1. This is so that everyone can have a go at alignment and counting on their laptops using RStudio.</p>
<p><em>Don’t forget to open a new project specifying the directory where you have saved all the data files for day 2</em></p>
<p>First, let’s load the Rsubread package into R.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(Rsubread)</code></pre></div>
<p>Earlier we put all the sequencing read data (.fastq.gz files) in the data directory. Now we need to find them in order to tell the Rsubread aligner which files to look at. We can search for all .fastq.gz files in the data directory using the <code>list.files</code> command. The pattern argument takes a regular expression. In this case we are using the <code>$</code> to mean the end of the string, so we will only get files that end in “.fastq.gz”</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">fastq.files <-<span class="st"> </span><span class="kw">list.files</span>(<span class="dt">path =</span> <span class="st">"./data"</span>, <span class="dt">pattern =</span> <span class="st">".fastq.gz$"</span>, <span class="dt">full.names =</span> <span class="ot">TRUE</span>)
fastq.files</code></pre></div>
<pre class="output"><code> [1] "./data/SRR1552444.fastq.gz" "./data/SRR1552445.fastq.gz"
[3] "./data/SRR1552446.fastq.gz" "./data/SRR1552447.fastq.gz"
[5] "./data/SRR1552448.fastq.gz" "./data/SRR1552449.fastq.gz"
[7] "./data/SRR1552450.fastq.gz" "./data/SRR1552451.fastq.gz"
[9] "./data/SRR1552452.fastq.gz" "./data/SRR1552453.fastq.gz"
[11] "./data/SRR1552454.fastq.gz" "./data/SRR1552455.fastq.gz"
</code></pre>
<h2 id="alignment">Alignment</h2>
<h3 id="build-the-index">Build the index</h3>
<p>Read sequences are stored in compressed (gzipped) FASTQ files. Before the differential expression analysis can proceed, these reads must be aligned to the mouse genome and counted into annotated genes. This can be achieved with functions in the Rsubread package.</p>
<p>The first step in performing the alignment is to build an index. In order to build an index you need to have the fasta file (.fa), which can be downloaded from the UCSC genome browser. Here we are building the index just for chromosome 1. This may take several minutes to run. Building the full index using the whole mouse genome usually takes about 30 minutes to an hr on a server. We won’t be building the index in the workshop due to time constraints, we have provided the index files for you. The command below assumes the chr1 genome information for mm10 is stored in the “chr1.fa” file.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># buildindex(basename="chr1_mm10",reference="chr1.fa")</span></code></pre></div>
<p>The above command will generate the index files in the working directory. In this example, the prefix for the index files is chr1_mm10. You can see the additional files generated using the <code>dir</code> command, which will list every file in your current working directory.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">dir</span>()</code></pre></div>
<h3 id="aligning-reads-to-chromosome-1-of-reference-genome">Aligning reads to chromosome 1 of reference genome</h3>
<p>Now that we have generated our index, we can align our reads using the <code>align</code> command. There are often numerous mapping parameters that we can specify, but usually the default mapping parameters for the <code>align</code> function are fine. If we had paired end data, we would specify the second read file/s using the <code>readfile2</code> argument. Our mouse data comprises 100bp single end reads.</p>
<p>We can specify the output files, or we can let <code>Rsubread</code> choose the output file names for us. The default output file name is the filename with “.subread.BAM” added at the end.</p>
<p>Now we can align our 12 fastq.gz files using the <code>align</code> command.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">align</span>(<span class="dt">index=</span><span class="st">"data/chr1_mm10"</span>,<span class="dt">readfile1=</span>fastq.files)</code></pre></div>
<p>This will align each of the 12 samples one after the other. As we’re only using a subset of 1000 reads per sample, aligning should just take a minute or so for each sample. To run the full samples from this dataset would take several hours per sample. The BAM files are saved in the working directory.</p>
<p>To see how many parameters you can change try the <code>args</code> function:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">args</span>(align)</code></pre></div>
<pre class="output"><code>function (index, readfile1, readfile2 = NULL, type = "rna", input_format = "gzFASTQ",
output_format = "BAM", output_file = paste(as.character(readfile1),
"subread", output_format, sep = "."), nsubreads = 10,
TH1 = 3, TH2 = 1, maxMismatches = 3, nthreads = 1, indels = 5,
complexIndels = FALSE, phredOffset = 33, unique = TRUE, nBestLocations = 1,
minFragLength = 50, maxFragLength = 600, PE_orientation = "fr",
nTrim5 = 0, nTrim3 = 0, readGroupID = NULL, readGroup = NULL,
color2base = FALSE, DP_GapOpenPenalty = -1, DP_GapExtPenalty = 0,
DP_MismatchPenalty = 0, DP_MatchScore = 2, detectSV = FALSE)
NULL
</code></pre>
<p>In this example we have kept many of the default settings, which have been optimised to work well under a variety of situations. The default setting for <code>align</code> is that it only keeps reads that uniquely map to the reference genome. For testing differential expression of genes, this is what we want, as the reads are unambigously assigned to one place in the genome, allowing for easier interpretation of the results. Understanding all the different parameters you can change involves doing a lot of reading about the aligner that you are using, and can take a lot of time to understand! Today we won’t be going into the details of the parameters you can change, but you can get more information from looking at the help:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">?align</code></pre></div>
<p>We can get a summary of the proportion of reads that mapped to the reference genome using the <code>propmapped</code> function.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">bam.files <-<span class="st"> </span><span class="kw">list.files</span>(<span class="dt">path =</span> <span class="st">"./data"</span>, <span class="dt">pattern =</span> <span class="st">".BAM$"</span>, <span class="dt">full.names =</span> <span class="ot">TRUE</span>)
bam.files</code></pre></div>
<pre class="output"><code> [1] "./data/SRR1552444.fastq.gz.subread.BAM"
[2] "./data/SRR1552445.fastq.gz.subread.BAM"
[3] "./data/SRR1552446.fastq.gz.subread.BAM"
[4] "./data/SRR1552447.fastq.gz.subread.BAM"
[5] "./data/SRR1552448.fastq.gz.subread.BAM"
[6] "./data/SRR1552449.fastq.gz.subread.BAM"
[7] "./data/SRR1552450.fastq.gz.subread.BAM"
[8] "./data/SRR1552451.fastq.gz.subread.BAM"
[9] "./data/SRR1552452.fastq.gz.subread.BAM"
[10] "./data/SRR1552453.fastq.gz.subread.BAM"
[11] "./data/SRR1552454.fastq.gz.subread.BAM"
[12] "./data/SRR1552455.fastq.gz.subread.BAM"
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">props <-<span class="st"> </span><span class="kw">propmapped</span>(<span class="dt">files=</span>bam.files)</code></pre></div>
<pre class="output"><code>The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a 'ulimit -n 500' command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 133; the mappability is 13.30%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a 'ulimit -n 500' command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 130; the mappability is 13.00%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a 'ulimit -n 500' command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 111; the mappability is 11.10%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a 'ulimit -n 500' command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 103; the mappability is 10.30%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a 'ulimit -n 500' command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 53; the mappability is 5.30%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a 'ulimit -n 500' command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 67; the mappability is 6.70%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a 'ulimit -n 500' command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 119; the mappability is 11.90%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a 'ulimit -n 500' command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 145; the mappability is 14.50%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a 'ulimit -n 500' command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 133; the mappability is 13.30%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a 'ulimit -n 500' command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 142; the mappability is 14.20%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a 'ulimit -n 500' command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 127; the mappability is 12.70%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a 'ulimit -n 500' command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 113; the mappability is 11.30%
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">props</code></pre></div>
<pre class="output"><code> Samples NumTotal NumMapped PropMapped
1 ./data/SRR1552444.fastq.gz.subread.BAM 1000 133 0.133
2 ./data/SRR1552445.fastq.gz.subread.BAM 1000 130 0.130
3 ./data/SRR1552446.fastq.gz.subread.BAM 1000 111 0.111
4 ./data/SRR1552447.fastq.gz.subread.BAM 1000 103 0.103
5 ./data/SRR1552448.fastq.gz.subread.BAM 1000 53 0.053
6 ./data/SRR1552449.fastq.gz.subread.BAM 1000 67 0.067
7 ./data/SRR1552450.fastq.gz.subread.BAM 1000 119 0.119
8 ./data/SRR1552451.fastq.gz.subread.BAM 1000 145 0.145
9 ./data/SRR1552452.fastq.gz.subread.BAM 1000 133 0.133
10 ./data/SRR1552453.fastq.gz.subread.BAM 1000 142 0.142
11 ./data/SRR1552454.fastq.gz.subread.BAM 1000 127 0.127
12 ./data/SRR1552455.fastq.gz.subread.BAM 1000 113 0.113
</code></pre>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge"><span class="glyphicon glyphicon-pencil"></span>Challenge</h2>
</div>
<div class="panel-body">
<ol style="list-style-type: decimal">
<li>Try aligning the fastq files allowing multi-mapping reads (set <code>unique = FALSE</code>), and allowing for up to 6 “best” locations to be reported (<code>nBestLocations = 6</code>). Specify the output file names (bam.files.multi) by substituting “.fastq.gz” with “.multi.bam” so we don’t overwrite our unique alignment bam files.</li>
<li>Look at the proportion of reads mapped and see if we get any more reads mapping by specifying a less stringent criteria.</li>
</ol>
</div>
</section>
<h2 id="quality-control">Quality control</h2>
<p>We can have a look at the quality scores associated with each base that has been called by the sequencing machine using the <code>qualityScores</code> function in <em>Rsubread</em>.</p>
<p>Let’s first extract quality scores for 100 reads for the file “SRR1552450.fastq.gz”.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Extract quality scores</span>
qs <-<span class="st"> </span><span class="kw">qualityScores</span>(<span class="dt">filename=</span><span class="st">"data/SRR1552450.fastq.gz"</span>,<span class="dt">nreads=</span><span class="dv">100</span>)</code></pre></div>
<pre class="output"><code>
qualityScores Rsubread 1.22.3
Scan the input file...
Totally 1000 reads were scanned; the sampling interval is 9.
Now extract read quality information...
Completed successfully. Quality scores for 100 reads (equally spaced in the file) are returned.
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Check dimension of qs</span>
<span class="kw">dim</span>(qs)</code></pre></div>
<pre class="output"><code>[1] 100 100
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Check first few elements of qs with head</span>
<span class="kw">head</span>(qs)</code></pre></div>
<pre class="output"><code> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[1,] 31 34 31 37 37 37 37 37 39 37 39 39 39 41 40 41 41 41 36 38 40 41 39
[2,] 34 34 34 37 37 37 37 37 39 39 39 39 39 41 41 41 41 39 40 41 41 41 41
[3,] 34 34 31 37 37 37 37 37 39 39 39 39 39 41 41 41 41 41 41 41 41 41 41
[4,] 34 34 34 37 37 37 37 37 39 39 39 39 39 41 41 41 41 39 40 40 41 41 41
[5,] 34 34 34 37 37 37 37 37 39 39 39 39 39 41 41 41 40 41 41 41 41 41 41
[6,] 34 34 34 37 37 37 37 37 39 39 39 39 39 41 41 41 41 41 41 41 41 41 41
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[1,] 40 38 40 40 40 40 40 38 40 40 40 38 37 39 40 40 41 40 41 41 38 40 40
[2,] 41 41 41 41 41 41 41 41 41 38 40 41 41 41 41 41 41 41 41 41 41 41 41
[3,] 41 41 41 41 41 41 40 40 41 38 38 38 40 41 40 41 39 40 40 41 41 41 39
[4,] 41 40 41 41 41 41 41 40 41 41 41 41 40 41 41 41 41 41 41 41 41 41 41
[5,] 40 41 41 41 41 41 39 40 41 41 41 41 40 41 41 41 41 41 41 41 41 41 41
[6,] 41 41 36 40 40 41 41 41 41 40 38 40 38 40 41 41 40 41 41 41 40 40 41
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
[1,] 40 41 41 41 40 40 36 40 38 37 37 35 35 33 35 35 33 35 34 35 35 35 34
[2,] 41 41 41 41 41 41 41 41 41 41 41 40 41 41 41 39 39 39 39 37 37 37 37
[3,] 40 38 40 40 37 39 40 41 41 40 37 39 39 39 37 37 37 37 37 37 36 36 35
[4,] 41 41 41 41 41 41 41 41 41 40 40 41 40 41 41 41 41 40 41 41 39 39 39
[5,] 41 41 41 41 41 40 41 41 41 41 40 41 41 41 38 40 41 41 41 41 39 39 39
[6,] 40 40 40 41 40 39 33 36 35 34 35 35 35 35 35 35 35 33 33 33 34 31 31
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
[1,] 34 34 35 35 35 35 35 35 35 36 35 35 35 34 34 35 34 35 35 35 35 35 35
[2,] 36 36 36 34 36 35 35 35 35 35 35 35 35 35 35 37 35 36 35 35 35 35 35
[3,] 36 35 33 35 35 35 35 34 36 35 35 34 35 35 35 34 35 35 35 35 35 33 35
[4,] 39 37 37 37 37 37 37 36 36 36 36 36 36 35 35 35 35 35 35 35 35 35 34
[5,] 39 39 38 37 37 37 37 37 35 35 35 35 35 36 35 35 35 35 35 35 35 35 35
[6,] 34 29 34 33 34 35 32 32 34 34 33 31 34 34 18 24 32 33 35 35 27 32 31
93 94 95 96 97 98 99 100
[1,] 35 35 35 35 35 35 35 33
[2,] 35 36 36 35 35 34 34 35
[3,] 35 35 31 34 35 35 33 33
[4,] 35 35 35 35 35 35 35 35
[5,] 35 36 35 35 35 35 35 35
[6,] 32 34 34 35 34 34 33 34
</code></pre>
<p>A quality score of 30 corresponds to a 1 in 1000 chance of an incorrect base call. (A quality score of 10 is a 1 in 10 chance of an incorrect base call.) To look at the overall distribution of quality scores across the 100 reads, we can look at a boxplot</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">boxplot</span>(qs)</code></pre></div>
<p><img src="fig-07/unnamed-chunk-12-1.png" title="plot of chunk unnamed-chunk-12" alt="plot of chunk unnamed-chunk-12" style="display: block; margin: auto;" /></p>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge-1"><span class="glyphicon glyphicon-pencil"></span>Challenge</h2>
</div>
<div class="panel-body">
<ol style="list-style-type: decimal">
<li>Extract quality scores for SRR1552451.fastq.gz for 50 reads.</li>
<li>Plot a boxplot of the quality scores for SRR1552451.fastq.gz.</li>
</ol>
</div>
</section>
<h2 id="counting">Counting</h2>
<p>Now that we have figured out where each read comes from in the genome, we need to summarise the information across genes or exons. The alignment produces a set of BAM files, where each file contains the read alignments for each library. In the BAM file, there is a chromosomal location for every read that mapped uniquely. The mapped reads can be counted across mouse genes by using the <code>featureCounts</code> function. <code>featureCounts</code> contains built-in annotation for mouse (mm9, mm10) and human (hg19) genome assemblies (NCBI refseq annotation).</p>
<p>The code below uses the exon intervals defined in the NCBI refseq annotation of the mm10 genome. Reads that map to exons of genes are added together to obtain the count for each gene, with some care taken with reads that span exon-exon boundaries. <code>featureCounts</code> takes all the BAM files as input, and outputs an object which includes the count matrix, similar to the count matrix we have been working with on Day 1. Each sample is a separate column, each row is a gene.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">fc <-<span class="st"> </span><span class="kw">featureCounts</span>(bam.files, <span class="dt">annot.inbuilt=</span><span class="st">"mm10"</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># See what slots are stored in fc</span>
<span class="kw">names</span>(fc)</code></pre></div>
<pre class="output"><code>[1] "counts" "annotation" "targets" "stat"
</code></pre>
<p>The statistics of the read mapping can be seen with fc$stats. This reports the numbers of unassigned reads and the reasons why they are not assigned (eg. ambiguity, multi-mapping, secondary alignment, mapping quality, fragment length, chimera, read duplicate, non-junction and so on), in addition to the number of successfully assigned reads for each library. (We know the real reason why the majority of the reads aren’t mapping - they’re not from chr 1!)</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## Take a look at the featurecounts stats
fc$stat</code></pre></div>
<pre class="output"><code> Status ..data.SRR1552444.fastq.gz.subread.BAM
1 Assigned 48
2 Unassigned_Ambiguity 0
3 Unassigned_MultiMapping 0
4 Unassigned_NoFeatures 85
5 Unassigned_Unmapped 867
6 Unassigned_MappingQuality 0
7 Unassigned_FragmentLength 0
8 Unassigned_Chimera 0
9 Unassigned_Secondary 0
10 Unassigned_Nonjunction 0
11 Unassigned_Duplicate 0
..data.SRR1552445.fastq.gz.subread.BAM
1 45
2 1
3 0
4 84
5 870
6 0
7 0
8 0
9 0
10 0
11 0
..data.SRR1552446.fastq.gz.subread.BAM
1 34
2 0
3 0
4 77
5 889
6 0
7 0
8 0
9 0
10 0
11 0
..data.SRR1552447.fastq.gz.subread.BAM
1 34
2 0
3 0
4 69
5 897
6 0
7 0
8 0
9 0
10 0
11 0
..data.SRR1552448.fastq.gz.subread.BAM
1 13
2 1
3 0
4 39
5 947
6 0
7 0
8 0
9 0
10 0
11 0
..data.SRR1552449.fastq.gz.subread.BAM
1 20
2 0
3 0
4 47
5 933
6 0
7 0
8 0
9 0
10 0
11 0
..data.SRR1552450.fastq.gz.subread.BAM
1 50
2 1
3 0
4 68
5 881
6 0
7 0
8 0
9 0
10 0
11 0
..data.SRR1552451.fastq.gz.subread.BAM
1 69
2 0
3 0
4 76
5 855
6 0
7 0
8 0
9 0
10 0
11 0
..data.SRR1552452.fastq.gz.subread.BAM
1 50
2 2
3 0
4 81
5 867
6 0
7 0
8 0
9 0
10 0
11 0
..data.SRR1552453.fastq.gz.subread.BAM
1 52
2 1
3 0
4 89
5 858
6 0
7 0
8 0
9 0
10 0
11 0
..data.SRR1552454.fastq.gz.subread.BAM
1 54
2 0
3 0
4 73
5 873
6 0
7 0
8 0
9 0
10 0
11 0
..data.SRR1552455.fastq.gz.subread.BAM
1 42
2 0
3 0
4 71
5 887
6 0
7 0
8 0
9 0
10 0
11 0
</code></pre>
<p>The counts for the samples are stored in fc$counts. Take a look at that.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## Take a look at the dimensions to see the number of genes
<span class="kw">dim</span>(fc$counts)</code></pre></div>
<pre class="output"><code>[1] 27179 12
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## Take a look at the first 6 lines
<span class="kw">head</span>(fc$counts)</code></pre></div>
<pre class="output"><code> ..data.SRR1552444.fastq.gz.subread.BAM
497097 0
100503874 0
100038431 0
19888 0
20671 0
27395 0
..data.SRR1552445.fastq.gz.subread.BAM
497097 0
100503874 0
100038431 0
19888 0
20671 0
27395 0
..data.SRR1552446.fastq.gz.subread.BAM
497097 0
100503874 0
100038431 0
19888 0
20671 0
27395 0
..data.SRR1552447.fastq.gz.subread.BAM
497097 0
100503874 0
100038431 0
19888 0
20671 0
27395 0
..data.SRR1552448.fastq.gz.subread.BAM
497097 0
100503874 0
100038431 0
19888 0
20671 0
27395 0
..data.SRR1552449.fastq.gz.subread.BAM
497097 0
100503874 0
100038431 0
19888 0
20671 0
27395 0
..data.SRR1552450.fastq.gz.subread.BAM
497097 0
100503874 0
100038431 0
19888 0
20671 0
27395 0
..data.SRR1552451.fastq.gz.subread.BAM
497097 0
100503874 0
100038431 0
19888 0
20671 0
27395 0
..data.SRR1552452.fastq.gz.subread.BAM
497097 0
100503874 0
100038431 0
19888 0
20671 0
27395 0
..data.SRR1552453.fastq.gz.subread.BAM
497097 0
100503874 0
100038431 0
19888 0
20671 0
27395 0
..data.SRR1552454.fastq.gz.subread.BAM
497097 0
100503874 0
100038431 0
19888 0
20671 0
27395 0
..data.SRR1552455.fastq.gz.subread.BAM
497097 0
100503874 0
100038431 0
19888 0
20671 0
27395 0
</code></pre>
<p>The row names of the fc$counts matrix represent the Entrez gene identifiers for each gene and the column names are the output filenames from calling the <code>align</code> function. The <code>annotation</code> slot shows the annotation information that <code>featureCounts</code> used to summarise reads over genes.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">head</span>(fc$annotation)</code></pre></div>
<pre class="output"><code> GeneID Chr
1 497097 chr1;chr1;chr1
2 100503874 chr1;chr1
3 100038431 chr1
4 19888 chr1;chr1;chr1;chr1;chr1;chr1
5 20671 chr1;chr1;chr1;chr1;chr1
6 27395 chr1;chr1;chr1;chr1;chr1
Start
1 3214482;3421702;3670552
2 3647309;3658847
3 3680155
4 4290846;4343507;4351910;4352202;4360200;4409170
5 4490928;4493100;4493772;4495136;4496291
6 4773198;4777525;4782568;4783951;4785573
End Strand Length
1 3216968;3421901;3671498 -;-;- 3634
2 3650509;3658904 -;- 3259
3 3681788 + 1634
4 4293012;4350091;4352081;4352837;4360314;4409241 -;-;-;-;-;- 9747
5 4492668;4493466;4493863;4495942;4496413 -;-;-;-;- 3130
6 4776801;4777648;4782733;4784105;4785726 -;-;-;-;- 4203
</code></pre>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge-2"><span class="glyphicon glyphicon-pencil"></span>Challenge</h2>
</div>
<div class="panel-body">
<ol style="list-style-type: decimal">
<li>Redo the counting over the exons, rather than the genes (specify <code>useMetaFeatures = FALSE</code>). Use the bam files generated doing alignment reporting only unique reads, and call the <code>featureCounts</code> object <code>fc.exon</code>. Check the dimension of the counts slot to see how much larger it is.</li>
<li>Using your “.multi.bam” files, redo the counting over genes, allowing for multimapping reads (specify <code>countMultiMappingReads = TRUE</code>), calling the object <code>fc.multi</code>. Check the stats.</li>
</ol>
</div>
</section>
<p>Notes</p>
<ul>
<li>If you are sequencing your own data, the sequencing facility will almost always provide fastq files.<br />
</li>
<li>For publicly available sequence data from GEO/SRA, the files are usually in the Sequence Read Archive (SRA) format. Prior to read alignment, these files need to be converted into the FASTQ format using the fastq-dump utility from the SRA Toolkit. See http: //www.ncbi.nlm.nih.gov/books/NBK158900 for how to download and use the SRA Toolkit.<br />
</li>
<li>By default, alignment is performed with <code>unique=TRUE</code>. If a read can be aligned to two or more locations, <em>Rsubread</em> will attempt to select the best location using a number of criteria. Only reads that have a unique best location are reported as being aligned. Keeping this default is recommended, as it avoids spurious signal from non-uniquely mapped reads derived from, e.g., repeat regions.<br />
</li>
<li>The Phred offset determines the encoding for the base-calling quality string in the FASTQ file. For the Illumina 1.8 format onwards, this encoding is set at +33. However, older formats may use a +64 encoding. Users should ensure that the correct encoding is specified during alignment. If unsure, one can examine the first several quality strings in the FASTQ file. A good rule of thumb is to check whether lower-case letters are present (+64 encoding) or absent (+33).<br />
</li>
<li><code>featureCounts</code> requires gene annotation specifying the genomic start and end position of each exon of each gene. <em>Rsubread</em> contains built-in gene annotation for mouse and human. For other species, users will need to read in a data frame in GTF format to define the genes and exons. Users can also specify a custom annotation file in SAF format. See the Rsubread users guide for more information, or try <code>?featureCounts</code>, which has an example of what an SAF file should like like.</li>
</ul>
<h1 id="package-versions-used">Package versions used</h1>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">sessionInfo</span>()</code></pre></div>
<pre class="output"><code>R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)
locale:
[1] en_AU.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets base
other attached packages:
[1] Rsubread_1.22.3 knitr_1.14
loaded via a namespace (and not attached):
[1] magrittr_1.5 formatR_1.4 tools_3.3.1 stringi_1.1.1 methods_3.3.1
[6] stringr_1.1.0 evaluate_0.9
</code></pre>
<h1 id="references" class="unnumbered">References</h1>
<div id="refs" class="references">
<div id="ref-Dobin2013">
<p>Dobin, Alexander, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. 2013. “STAR: ultrafast universal RNA-seq aligner.” <em>Bioinformatics (Oxford, England)</em> 29 (1): 15–21. doi:<a href="https://doi.org/10.1093/bioinformatics/bts635">10.1093/bioinformatics/bts635</a>.</p>
</div>
<div id="ref-Fu2015">
<p>Fu, Nai Yang, Anne C Rios, Bhupinder Pal, Rina Soetanto, Aaron T L Lun, Kevin Liu, Tamara Beck, et al. 2015. “EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival.” <em>Nature Cell Biology</em> 17 (4): 365–75. doi:<a href="https://doi.org/10.1038/ncb3117">10.1038/ncb3117</a>.</p>
</div>
<div id="ref-Langmead2012">
<p>Langmead, Ben, and Steven L Salzberg. 2012. “Fast gapped-read alignment with Bowtie 2.” <em>Nature Methods</em> 9 (4). Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.: 357–9. doi:<a href="https://doi.org/10.1038/nmeth.1923">10.1038/nmeth.1923</a>.</p>
</div>
<div id="ref-liao2013subread">
<p>Liao, Yang, Gordon K Smyth, and Wei Shi. 2013. “The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.” <em>Nucleic Acids Research</em> 41: 16 pages.</p>
</div>
<div id="ref-Lun2016">
<p>Lun, Aaron T L, Yunshun Chen, and Gordon K Smyth. 2016. “It’s DE-licious: A Recipe for Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Methods in edgeR.” <em>Methods in Molecular Biology (Clifton, N.J.)</em> 1418 (January): 391–416. doi:<a href="https://doi.org/10.1007/978-1-4939-3578-9\_19">10.1007/978-1-4939-3578-9\_19</a>.</p>
</div>
<div id="ref-trapnell2009tophat">
<p>Trapnell, Cole, Lior Pachter, and Steven L Salzberg. 2009. “TopHat: discovering splice junctions with RNA-seq.” <em>Bioinformatics</em> 25 (9): 1105–11. doi:<a href="https://doi.org/doi:10.1093/bioinformatics/btp120">doi:10.1093/bioinformatics/btp120</a>.</p>
</div>
</div>
</div>
</div>
</article>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
<script src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'></script>
</body>
</html>