-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathweek3lab.Rmd
370 lines (257 loc) · 9.84 KB
/
week3lab.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
# The glory of dplyr!
The fancy printing of `help` files, etc., in this document is handled by `printr`:
```{r}
library(printr)
```
Today's exercise requires some imagination on your part.
We will not give you a massive data set.
Rather, we provide you the foundation to efficiently handle some very large computations.
Some of you won't "get it".
Not now at least.
But, [a recent paper](https://www.genetics.org/content/213/4/1513) of mine literally would not have been possible without the methods outlined here.
## The problem
1. You have a very large amount of data.
It cannot fit into RAM, so you cannot use regular `R/dplyr` or `pandas` methods to process it.
2. There's no pre-existing tool for your analysis.
For example, this isn't something that `samtools` or `bedtools` or some other domain-specific software already do.
## Thinking about a solution
Let's reason through a few things:
1. The data can't fit in RAM, so they need to go in a file.
2. So, we need a file format.
3. We also need code to process it.
## Rectangular data and split/apply/combine
Let's fire up the venerable `mtcars` data set and take a look:
```{r}
data(mtcars)
help(mtcars)
```
The first several rows are:
```{r}
head(mtcars)
```
The data are "rectangular", meaning that they are set up like a spreadsheet.
A typical analysis of such data involves:
1. **Split** the data up into groups.
For example, break the data set up according to number of cylinders or numb of gears.
One could also split by unique `(cylinder, gear)` combos.
2. **Apply** a function to each group.
For example, get the average `mpg` for each group.
3. **Combine** ("aggregate") the results back into a new data frame.
In `base` `R`, we can do these analyses using `aggregate`.
These analyses use `R`'s formula notation, `response ~ factors`.
This kind of formula notation is common in statistical languages, and you'll see it in many corners of `R`.
Let's get the mean `mpg` after grouping by `cyl`:
```{r}
aggregate(mpg ~ cyl, data=mtcars, mean)
```
Now, get the means for all `(cyl, gear)` combos:
```{r}
aggregate(mpg ~ cyl + gear, data=mtcars, mean)
```
Okay, cool.
The code isn't too bad, and it uses a common syntax that is all over the `R` world (the "formula").
Let's look at the `dplyr` version now.
The first example is:
```{r}
library(dplyr)
results = mtcars %>%
group_by(cyl) %>%
summarise(mean_mpg = mean(mpg))
results
```
The second analysis becomes:
```{r}
results = mtcars %>%
group_by(cyl, gear) %>%
summarise(mean_mpg = mean(mpg))
as.data.frame(results)
```
To summarise:
1. We have a new syntax.
Boo!
2. This syntax is "weird", maybe?
3. It is a bit longer/more verbose.
So, why bother?
1. As your analysis becomes more complex, the `dplyr` variant will tend to be *less* verbose than the `base` version.
2. The `dplyr` solution will be *vastly* faster.
3. Finally, `dplyr` has magic super powers that help solve the problem of data sets too big to fit in memory.
This is the part that requires imagination, as our examples so far are in memory.
Trust me, though, the payoff is coming...
### Some technical bits
1. The `%>%` code chains together a series of function calls.
To read it, to left to right starting after the `=`:
* Take the `mtcars` data.
* Pass it to `group_by`, which is our **split** step.
* Pass each group to `summarise`, which is our **apply** step.
* Return the result into `results`, which is our **combine** step.
2. The `as.data.frame` business in the last example is only needed so that `printr` gives this page a nice rendering.
Technically, these operations return a `tibble`, which is a fancified version of a `data.frame`.
You can read more about `tibble`s in the [R for Data Science](https://r4ds.had.co.nz/) book.
Most of the time, though, you can take them for granted and pretend you have a vanilla `data.frame`.
(That is intentional!
A `tibble` is designed to be as close to a drop-in replacement as possible.)
## A solution for very large data sets
So, if our data can be arranged as one or more "tidy" spreadsheets and our analysis will be "split/apply/combine", then the solution to our problem is:
1. Store each data frame/"spreadsheet" as a *table* in a *relational database* software.
2. Use [dplyr](https://dplyr.tidyverse.org/) for our analysis.
### Relational databases
To oversimplify a bit, these are spreadsheets/data frames stored on disk.
Key examples of open-source software versions are:
* MySQL
* PostgreSQL
* sqlite3
All of these have `SQL` in the name, which means "structured query language".
This is an "English-adjacent" language for performing split/apply/combine analysis on databases.
Each program uses a slightly different `SQL` "dialect", which is quite irritating.
The first two tools (My and Postgre) require administrative privileges to use.
The latter, sqlite3, does not, making it amenable to using in your pipelines.
You give up some speed, though.
### Databases plus dplyr
This is the payoff: `dplyr` can analyze data *in a relational database* using the **same code** that you use for in-memory *data frames*!!!
This capability of `dplyr` is due to some very impressive software engineering!
It will write the `SQL` syntax for you.
It can do so for several `SQL` dialects.
The only substantive difference is that your input to a `dplyr` analysis will be a connection to a database table rather than a data frame that's already in memory.
#### Caveat
You can only perform analyses if the `dplyr` "verbs" you use have a direct analog in the `SQL` dialect of your database.
This limitation will not affect you now, but may in the future.
#### Documentation
* [dbplyr](https://dbplyr.tidyverse.org/).
* [DBI](https://dbi.r-dbi.org/)
### A working example
```{r, echo=F}
if (file.exists("mtcars.sqlite3"))
{
file.remove("mtcars.sqlite3")
}
```
First, get our data into an `sqlite3` database:
```{r}
library(dbplyr)
# Create a connection ("con") to a database file:
con <- DBI::dbConnect(RSQLite::SQLite(), "mtcars.sqlite3")
# Write our data frame to the database in a table called "mtcars"
DBI::dbWriteTable(con, "mtcars", mtcars)
# Disconnect from our database
DBI::dbDisconnect(con)
```
#### Warning
The `dbplyr` docs use the following code, which **did not** work for me:
```{r, eval=F}
copy_to(con, mtcars)
```
Let's check that we have a database whose file is $> 0$:
```{sh}
ls -lhrt *.sqlite3
```
#### Analyze our database using dplyr
```{r}
con <- DBI::dbConnect(RSQLite::SQLite(), "mtcars.sqlite3")
mtcars2 <- tbl(con, "mtcars")
g = mtcars2 %>%
group_by(cyl) %>%
summarise(mean_mpg=mean(mpg))
```
The object `g` is **not** (yet) a `dplyr` result!
It is an `SQL` query:
```{r}
g %>% show_query()
```
We need to `collect` it to execute the query:
```{r}
result = g %>% collect()
as.data.frame(result)
```
### This looks the same--what is the point?
Well, the point is that it (mostly) looks the same!
However, there is a *big* difference!
The *query* is executed by the `sqlite3` database software.
By doing so, the entire database is **not** loaded into memory.
In other words:
* When your data fit into memory, use `dplyr`.
* When your data do not fit into memory, use `dplyr` (to analyze stuff in a database).
**One toolkit can handle all of your needs.**
## Python
```{r, echo=F}
if (file.exists("mtcars_from_pandas.sqlite3"))
{
file.remove("mtcars_from_pandas.sqlite3")
}
```
Python users can play these games, too.
```{r}
library(reticulate)
```
Grab `mtcars` into Python:
```{python}
mtcars = r.mtcars
mtcars.head()
```
[pandas](https://pandas.pydata.org)' `DataFrame` object has the split/apply/combine stuff built in.
Repeat our split/apply/combine analyses:
```{python}
mtcars.groupby(['cyl'])['mpg'].mean()
```
And the other one:
```{python}
mtcars.groupby(['cyl', 'gear'])['mpg'].mean()
```
### Pandas to sqlite
Simple:
```{python}
import sqlite3 # Built into the Python language!
con = sqlite3.connect("mtcars_from_pandas.sqlite3")
# Add our data frame to the mtcars table in the database
mtcars.to_sql("mtcars", con)
con.close()
```
Check that it worked:
```{sh}
ls -lhrt *.sqlite3
```
**NOTE:** The two data base files differ in size!
This is because `R` and `pandas` make different default decisions about the column data types.
Thus, your database may be bigger if you create it from `pandas` without changing the default, meaning you may be using more disk space than you need!
It is an "exercise for the interested reader" to learn how to change those defaults.
### Read it back in
This is where `R` is ahead of the curve.
Sadly, to get data back into a `pandas.DataFrame`, we must use "raw" `SQL`:
```{python}
import pandas as pd
con = sqlite3.connect("mtcars_from_pandas.sqlite3")
df = pd.read_sql("select * from mtcars", con)
df.head()
```
To do our analyses:
```{python}
df = pd.read_sql("select cyl, avg(mpg) from mtcars group by cyl", con)
df.head()
```
```{python}
df = pd.read_sql("select cyl, gear, avg(mpg) from mtcars group by cyl, gear", con)
df.head()
```
Most of the time, I'd use `dplyr` here instead.
The `SQL` shown here is deceptively simple.
Trust me--it gets complex fast!
### More things to learn
* `dplyr` and `pandas` are both capable of joining multiple data frames together.
`dplyr` is able to join `SQL` tables together.
* For big data sets, you usually add data to tables in "chunks" as they come.
So, you need to distinguish between *creating* a table vs *appending* to an existing one.
Often, *append* is the default.
So, when you make mistakes, you need to either `rm -f` your database file or learn to `drop` tables.
(That's another "exercise for the reader".)
## Your lab exercise for this week.
Repeat the above code blocks!
Couldn't be simpler, right!
(Famous last words.)
### Stuff to install
#### R (via RStudio)
At the very minimum, you'll need:
* dplyr
* dbplyr
* RSQLite
#### Python (via conda)
* sqlite3
You should have `pandas` from last week's exercise.