week3lab.Rmd

# The glory of dplyr!

The fancy printing of `help` files, etc., in this document is handled by `printr`:

```{r}
library(printr)
```

Today's exercise requires some imagination on your part.
We will not give you a massive data set.
Rather, we provide you the foundation to efficiently handle some very large computations.

Some of you won't "get it".
Not now at least.
But, [a recent paper](https://www.genetics.org/content/213/4/1513) of mine literally would not have been possible without the methods outlined here.

## The problem

1. You have a very large amount of data.
   It cannot fit into RAM, so you cannot use regular `R/dplyr` or `pandas` methods to process it.
2. There's no pre-existing tool for your analysis.
   For example, this isn't something that `samtools` or `bedtools` or some other domain-specific software already do.

## Thinking about a solution

Let's reason through a few things:

1. The data can't fit in RAM, so they need to go in a file.
2. So, we need a file format.
3. We also need code to process it.

## Rectangular data and split/apply/combine

Let's fire up the venerable `mtcars` data set and take a look:

```{r}
data(mtcars)
help(mtcars)
```

The first several rows are:

```{r}
head(mtcars)
```

The data are "rectangular", meaning that they are set up like a spreadsheet.

A typical analysis of such data involves:

1. **Split** the data up into groups.
   For example, break the data set up according to number of cylinders or numb of gears.
   One could also split by unique `(cylinder, gear)` combos.
2. **Apply** a function to each group.
   For example, get the average `mpg` for each group.
3. **Combine** ("aggregate") the results back into a new data frame.

In `base` `R`, we can do these analyses using `aggregate`.
These analyses use `R`'s formula notation, `response ~ factors`.
This kind of formula notation is common in statistical languages, and you'll see it in many corners of `R`.

Let's get the mean `mpg` after grouping by `cyl`:

```{r}
aggregate(mpg ~ cyl, data=mtcars, mean)
```

Now, get the means for all `(cyl, gear)` combos:

```{r}
aggregate(mpg ~ cyl + gear, data=mtcars, mean)
```

Okay, cool.
The code isn't too bad, and it uses a common syntax that is all over the `R` world (the "formula").
Let's look at the `dplyr` version now.

The first example is:

```{r}
library(dplyr)

results = mtcars %>%
    group_by(cyl) %>%
    summarise(mean_mpg = mean(mpg))
results
```

The second analysis becomes:

```{r}
results = mtcars %>%
    group_by(cyl, gear) %>%
    summarise(mean_mpg = mean(mpg))
as.data.frame(results)
```

To summarise:

1. We have a new syntax.
   Boo!
2. This syntax is "weird", maybe?
3. It is a bit longer/more verbose.

So, why bother?

1. As your analysis becomes more complex, the `dplyr` variant will tend to be *less* verbose than the `base` version.
2. The `dplyr` solution will be *vastly* faster.
3. Finally, `dplyr` has magic super powers that help solve the problem of data sets too big to fit in memory.
   This is the part that requires imagination, as our examples so far are in memory.
   Trust me, though, the payoff is coming...

### Some technical bits

1. The `%>%` code chains together a series of function calls.
   To read it, to left to right starting after the `=`:
   * Take the `mtcars` data.
   * Pass it to `group_by`, which is our **split** step.
   * Pass each group to `summarise`, which is our **apply** step.
   * Return the result into `results`, which is our **combine** step.
2. The `as.data.frame` business in the last example is only needed so that `printr` gives this page a nice rendering.
   Technically, these operations return a `tibble`, which is a fancified version of a `data.frame`.
   You can read more about `tibble`s in the [R for Data Science](https://r4ds.had.co.nz/) book.
   Most of the time, though, you can take them for granted and pretend you have a vanilla `data.frame`.
   (That is intentional!
   A `tibble` is designed to be as close to a drop-in replacement as possible.)


## A solution for very large data sets

So, if our data can be arranged as one or more "tidy" spreadsheets and our analysis will be "split/apply/combine", then the solution to our problem is:

1. Store each data frame/"spreadsheet" as a *table* in a *relational database* software.
2. Use [dplyr](https://dplyr.tidyverse.org/) for our analysis.

### Relational databases

To oversimplify a bit, these are spreadsheets/data frames stored on disk.
Key examples of open-source software versions are:

* MySQL
* PostgreSQL
* sqlite3

All of these have `SQL` in the name, which means "structured query language".
This is an "English-adjacent" language for performing split/apply/combine analysis on databases.
Each program uses a slightly different `SQL` "dialect", which is quite irritating.

The first two tools (My and Postgre) require administrative privileges to use.
The latter, sqlite3, does not, making it amenable to using in your pipelines.
You give up some speed, though.

### Databases plus dplyr

This is the payoff: `dplyr` can analyze data *in a relational database* using the **same code** that you use for in-memory *data frames*!!!

This capability of `dplyr` is due to some very impressive software engineering!
It will write the `SQL` syntax for you.
It can do so for several `SQL` dialects.

The only substantive difference is that your input to a `dplyr` analysis will be a connection to a database table rather than a data frame that's already in memory.

#### Caveat 

You can only perform analyses if the `dplyr` "verbs" you use have a direct analog in the `SQL` dialect of your database.
This limitation will not affect you now, but may in the future.

#### Documentation

* [dbplyr](https://dbplyr.tidyverse.org/).
* [DBI](https://dbi.r-dbi.org/)

### A working example

```{r, echo=F}
if (file.exists("mtcars.sqlite3"))
{
    file.remove("mtcars.sqlite3")
}
```

First, get our data into an `sqlite3` database:

```{r}
library(dbplyr)

# Create a connection ("con") to a database file:
con <- DBI::dbConnect(RSQLite::SQLite(), "mtcars.sqlite3")
# Write our data frame to the database in a table called "mtcars"
DBI::dbWriteTable(con, "mtcars", mtcars)
# Disconnect from our database
DBI::dbDisconnect(con)
```

#### Warning

The `dbplyr` docs use the following code, which **did not** work for me:

```{r, eval=F}
copy_to(con, mtcars)
```


Let's check that we have a database whose file is $> 0$:


```{sh}
ls -lhrt *.sqlite3
```

#### Analyze our database using dplyr

```{r}
con <- DBI::dbConnect(RSQLite::SQLite(), "mtcars.sqlite3")
mtcars2 <- tbl(con, "mtcars")
g = mtcars2 %>% 
    group_by(cyl) %>%
    summarise(mean_mpg=mean(mpg))
```

The object `g` is **not** (yet) a `dplyr` result!
It is an `SQL` query:

```{r}
g %>% show_query()
```

We need to `collect` it to execute the query:

```{r}
result = g %>% collect()
as.data.frame(result)
```

### This looks the same--what is the point?

Well, the point is that it (mostly) looks the same!
However, there is a *big* difference!
The *query* is executed by the `sqlite3` database software.
By doing so, the entire database is **not** loaded into memory.

In other words:

* When your data fit into memory, use `dplyr`.
* When your data do not fit into memory, use `dplyr` (to analyze stuff in a database).

**One toolkit can handle all of your needs.**

## Python

```{r, echo=F}
if (file.exists("mtcars_from_pandas.sqlite3"))
{
    file.remove("mtcars_from_pandas.sqlite3")
}
```

Python users can play these games, too.

```{r}
library(reticulate)
```

Grab `mtcars` into Python:

```{python}
mtcars = r.mtcars
mtcars.head()
```

[pandas](https://pandas.pydata.org)' `DataFrame` object has the split/apply/combine stuff built in.

Repeat our split/apply/combine analyses:

```{python}
mtcars.groupby(['cyl'])['mpg'].mean()
```

And the other one:

```{python}
mtcars.groupby(['cyl', 'gear'])['mpg'].mean()
```

### Pandas to sqlite

Simple:

```{python}
import sqlite3 # Built into the Python language!
con = sqlite3.connect("mtcars_from_pandas.sqlite3")
# Add our data frame to the mtcars table in the database
mtcars.to_sql("mtcars", con)
con.close()
```

Check that it worked:

```{sh}
ls -lhrt *.sqlite3
```

**NOTE:** The two data base files differ in size!
This is because `R` and `pandas` make different default decisions about the column data types.
Thus, your database may be bigger if you create it from `pandas` without changing the default, meaning you may be using more disk space than you need!
It is an "exercise for the interested reader" to learn how to change those defaults.

### Read it back in

This is where `R` is ahead of the curve.
Sadly, to get data back into a `pandas.DataFrame`, we must use "raw" `SQL`:

```{python}
import pandas as pd

con = sqlite3.connect("mtcars_from_pandas.sqlite3")
df = pd.read_sql("select * from mtcars", con)
df.head()
```

To do our analyses:

```{python}
df = pd.read_sql("select cyl, avg(mpg) from mtcars group by cyl", con)
df.head()
```

```{python}
df = pd.read_sql("select cyl, gear, avg(mpg) from mtcars group by cyl, gear", con)
df.head()
```

Most of the time, I'd use `dplyr` here instead.
The `SQL` shown here is deceptively simple.
Trust me--it gets complex fast!

### More things to learn

* `dplyr` and `pandas` are both capable of joining multiple data frames together.
  `dplyr` is able to join `SQL` tables together.
* For big data sets, you usually add data to tables in "chunks" as they come.
  So, you need to distinguish between *creating* a table vs *appending* to an existing one.
  Often, *append* is the default.
  So, when you make mistakes, you need to either `rm -f` your database file or learn to `drop` tables.
  (That's another "exercise for the reader".)

## Your lab exercise for this week.

Repeat the above code blocks!
Couldn't be simpler, right!
(Famous last words.)

### Stuff to install

#### R (via RStudio)

At the very minimum, you'll need:

* dplyr
* dbplyr
* RSQLite

#### Python (via conda)

* sqlite3

You should have `pandas` from last week's exercise.