Getting started with R

class: center, middle, inverse, title-slide

.title[
# Getting started with R
]
.subtitle[
## with
]
.author[
### your friendly neighbourhood A man
]

---

.remark-slide table:not(.table-unshaded) thead, .remark-slide table:not(.table-unshaded) tfoot, .remark-slide table:not(.table-unshaded) tr:nth-child(2n) {
background: #F4F4F4
}

kbd {
    display: inline;
      background: -ms-radial-gradient(top, ellipse cover,  #ddd 0%,white 100%);
  background: radial-gradient(ellipse at top right,  #ddd 0%,white 100%);
    border: 1px solid #CCCCCC;
    border-image: none 100% 1 0 stretch;
    border-radius: .125rem;
    box-shadow: 0 2px 0 1px #E6E6E6;
    color: #222325;
    cursor: default;
    display: inline-block;
    font-family: 'Fira Code', monospace;
    font-size: .9em;
    line-height: 1;
    min-width: .75rem;
    padding: 2px 5px;
    position: relative;
    text-align: center;
    top: -1px;
  }

kbd:hover {
    box-shadow: 0 1px 0 .5px #E6E6E6;
    top: 1px;
  }

blockquote {
    border: 0.5em solid #ccc;
    border-image: none 100% 1 0 stretch;
    display: inline-block;
    margin: 0;
    padding: 0.5em 1em;
    position: relative;
  }

blockquote>* {
    position: relative;
    z-index: 1;
  }

blockquote::before {
    background-color: #fff;
    bottom: -1em;
    content: "";
    left: 2em;
    position: absolute;
    right: 2em;
    top: -1em;
  }

blockquote cite {
    color: #757575;
    display: block;
    font-size: small;
    text-align: right;
    text-transform: uppercase;
  }

strong {
  position: relative;
  
}

strong {
  
  background-color: #FFD500;
  padding: 1px 3px;
  border-radius: 4px;
}

a {
	text-decoration: underline;
	color: #18272F;
	font-weight: 700 !important;
  position: relative;
  background-color: hsla(196, 61%, 58%, .25);
}

.center2 {
  margin: 0;
  position: absolute;
  top: 50%;
  left: 50%;
  -ms-transform: translate(-50%, -50%);
  transform: translate(-50%, -50%);
}

</style>

## So what are we going to do?

By the end of this short session, we want to be able to:

1. Understand types of data we might encounter, and how to describe it

2. Be able to derive some basic "summaries" of our data; a stepping stone to greater things.

3. Focus on a single question and understand how we can answer it using basic tools.

4. Plot basic histograms and another star-wars themed plot (okay thats a stretch).

This session will skim over a lot of things quickly. The intention is to not become full-blown programmers or statisticians in one day, but *introduce* ourselves to certain tools and mental models that will go a long way in helping us explore the potential of the data we might handle on a daily basis.

Simply becoming familiar with the fact that these tools exist and having a glimpse of the things you can do with them is a huge step forward. If stuff goes over your head on the first try, that is **okay**. This is a process.

---

## Getting around

This session comes with two set of materials:

1.  This slide deck right here, which is what you can follow along with. This is me in PDF format.

2.  Whenever you see a 🐧️ emoji on a slide, that means you need to refer to that section of the exercise notebook.

Download your copy [here](https://drive.google.com/file/d/1SEJs5mehLn1mE2skC9-L1nMG-R0iOJiX/view?usp=drive_link). Store it in the same folder you have created a new project for this session.

---
## Meet the team

Based on our poll, we decided to go with **penguins** to help us get started with our learning, and so be it! Meet the Palmer Penguins, who have generously consented to being studied by you today. 
We'll learn more about them shortly.

![Palmer Penguins ](images/lter_penguins.png)
<small>Artwork by Allison Horst</small>

---

### 🐧 Ex. 1.1 Download your co-workers

We need to bring this party down from [the Palmer Station, Antarctica](https://pallter.marine.rutgers.edu/) before we start.

To do that, open up your notebooks and enter `install.packages('palmerpenguins')` into the console.

Once it is downloaded, locate the `Ex. 1.1` heading and add a new code block. You can do this by pressing <kbd>Ctrl</kbd>+<kbd> Alt </kbd>+ <kbd>i</kbd>.

In your code block, load the fellas into your environment by typing `library(palmerpenguins)`.

To run your code, either press <kbd>Ctrl</kbd> + <kbd> Enter</kbd> to run that line, or the green `Play` icon on the right of the code chunk.

---

## Well, well, well, waddle we got here?

Sure would be nice to see what data we have here, right? There's two ways to do that.

### 🐧 Ex. 1.2 Inspect your data

Pressing <kbd> Ctrl </kbd> + <kbd>Enter</kbd> on a code chunk line executes it.

```r
penguins # Press Ctrl + Enter inside the code chunk
```

```
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
```

You can also use `View(penguins)` in your console to open it as a spreadsheet.

---

Another fun thing to do is see the **structure** of your dataset.

```r
str(penguins)
```

```
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
```

What are the various variable types you can see?

---
Before we do ANYTYHING, we need to include tidyverse. Which begs the question
## The heck is tidyverse anyway?

.pull-left[
Tidyverse is an awesome *collection* of packages designed for data science. It includes packages for:

1.  Data manipulation (dplyr, tidyr, forcats)
2.  Data visualization (ggplot2)
3.  Data import (readr)
4.  Data handling (stringr, lubridate)

]

.pull-right[
![The R community loves their hex stickers](images/tidyverse.png)
]

Wanna see the difference? You don't have to know code to see how cool this is.
---

# Let's compare

Scenario: We've got a bunch of penguins. We need to find all Adelie penguins with their bills greater than 40mm and convert this to centimeters, and sort them. Here is this scenario in two coding flavors:

--
#### In the time before God:

```r
result <- penguins[penguins$species == "Adelie" & penguins$bill_length_mm > 40, ]
result$flipper_length_cm <- result$flipper_length_mm / 10
names(result)[names(result) == 'flipper_length_mm'] <- 'flipper_length_cm'
result <- result[order(result$flipper_length_cm, decreasing = TRUE), ]

# It's like assembling IKEA furniture with a spoon.
```
--
That's so ugly it is almost Python.

#### But the then God said, let there be light:

```r
result <- penguins %>%
  filter(species == "Adelie", bill_length_mm > 40) %>%
  mutate(flipper_length_cm = flipper_length_mm / 10) %>%
  select(-flipper_length_mm) %>%
  arrange(desc(flipper_length_cm))
```

---

## Tidyverse has a grammar

The Tidyverse in R is built around the concept of a "grammar of data manipulation". This grammar is made up of distinct "verbs" that correspond to common data manipulation tasks.

Which is incredibly useful because this language is close to how our cognitive processes work! For example, if I were to spell out my process from the earlier slide:

1. Start with the dataset _and then_

2.  filter() to keep only rows where species = "Adelie" and `bill_length_mm` > 40 _and then_

3. Add a new column (or _mutate()_ the original dataset), converting flipper length from mm to cm _and then_

4. Exclude the original `flipper_length_mm` column ( _select()_ everything except that) _and then_

5. _arrange()_ to sort the data in descending order of flipper_length_cm.

The concept of **'and then'** is represented by something called the pipe operator `%>%`. It seamlessly passes the output of one step as the input to the next, sort of like water flowing through....well I guess a pipe.

> Pro tip: When you're in a code chunk, you can just press <kbd>Ctrl</kbd> + <kbd> M </kbd> and it will add the pipe! No need to type it manually. Try it.

---

## Our verbs for today

The verbs we want to worry about today come from the `dplyr` package, which is part of the tidyverse. While some of the exercises may include more of them, the ones you need to get comfortable with for most of your work are:

1. `select()` - Used for **selecting**, or removing, columns from your dataset.
2. `filter()` - Used for filtering **rows** based on certain values.
3. `mutate()` - Used for **adding** a new column, and then inserting whatever you want into it.
4. `arrange()` - Used for **sorting** your data based on a column, or group of columns. 
5. `group_by()` - Used for creating **groups** within your data's categorical variables.
6. `summarize()` - Used for creating a **summary** out of your data, reducing maybe 100 rows to just 2.

You get all of this goodness and more by including the tidyverse package in the beginning of your notebook.

```r
library(tidyverse)
```

But enough talk, let us explore our data now. 
---

# Pre-flight checklist.

Usually, when you're faced with a new dataset, you want to get a sense what all those numbers look like. Things like does a certain category dominate the dataset? How many things are we looking at? What are most of those things (the _average_) like? And so on.

Here's what I do:

1. Understand the types of variables. 
2. Understand the size of the dataset. 
3. Count whatever I can count.
4. Figure out what the typical values look like.
5. Plot some data.

You have already done a little bit of the first two, let is try counting a few things.

#### We will do this by asking a series of simple questions and then trying to answer them with code.

---

## How many penguins on each island?

Formulate your question into a list of steps:

1. First, take the data and then 
2. Group it by island
3. Summarize this by counting the number of penguins in each group

```r
penguins %>% 
  group_by(island) %>% 
  summarise(count = n())
```

```
## # A tibble: 3 × 2
##   island    count
##   <fct>     <int>
## 1 Biscoe      168
## 2 Dream       124
## 3 Torgersen    52
```

---

### TIMEOUT! What is n()?

Any time you see a function that you don't recognize, enter `?` followed by the function to read about it.

```r
?n()
```

Reading the help window, we know how that `n()` helps us count the number of items in a group! We could ALSO have done this using `count()`, another `dplyr` function

```r
penguins %>% 
  group_by(island) %>% 
  count()
```

```
## # A tibble: 3 × 2
## # Groups:   island [3]
##   island        n
##   <fct>     <int>
## 1 Biscoe      168
## 2 Dream       124
## 3 Torgersen    52
```

---

Would you like to see what is happening here? Go to this [link to check out the step-by-step process](https://tidydatatutor.com/vis.html#code=library%28dplyr%29%0Alibrary%28palmerpenguins%29%0Alibrary%28forcats%29%0Aset.seed%282021-12-03%29%0A%0Apenguins%20%25%3E%25%20%0A%20%20group_by%28island,%20sex%29%20%25%3E%25%20%0A%20%20summarise%28count%20%3D%20n%28%29%29&d=2023-12-09&lang=r&v=v1).

`summarize()` takes our grouped rows, does the counting and returns only the necessary data. A summary!

![Summarizing](images/summarize.png)
--

> You can use this website to input your code to see how each steps looks like (as long as it has the palmerpenguins data)

---

### 🐧 Ex. 2.1 Your turn, how many male and female penguins on each island?

Hint: Try running `?group_by()` in the console. Does it allow more than one variable? How can you add another one?

Next slide has the answer.

> For extra marks, visualize your code solution on the [Tidy Data Visualizer](https://tidydatatutor.com/vis.html#) and understand what is happening.

---

```r
penguins %>% 
  group_by(island, sex) %>% 
  summarise(count = n())
```

```
## # A tibble: 9 × 3
## # Groups:   island [3]
##   island    sex    count
##   <fct>     <fct>  <int>
## 1 Biscoe    female    80
## 2 Biscoe    male      83
## 3 Biscoe    <NA>       5
## 4 Dream     female    61
## 5 Dream     male      62
## 6 Dream     <NA>       1
## 7 Torgersen female    24
## 8 Torgersen male      23
## 9 Torgersen <NA>       5
```

---

Oh wait, are those some NAs I see there? That means that value was not recorded and is not available. You want to filter them out before you count anything right?

The `filter()` command can help in the following cases:

- **Keep only those rows which meet a certain criteria**

```r
# Only keep those after 2008
penguins %>% 
  filter(year > 2008) %>% 
  head(2) # Shows only the first two rows
```

```
## # A tibble: 2 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Biscoe             35          17.9               192        3725
## 2 Adelie  Biscoe             41          20                 203        4725
## # ℹ 2 more variables: sex <fct>, year <int>
```
--

- **Keep everything EXCEPT a certain value**

```r
penguins %>% 
  filter(island != "Adelie") %>%  
  head(2)
```

---
<div id="ihuuxumfnp" style="padding-left:0px;padding-right:0px;padding-top:10px;padding-bottom:10px;overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
<style>@import url("https://fonts.googleapis.com/css2?family=Chivo:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&display=swap");
@import url("https://fonts.googleapis.com/css2?family=Chivo:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&display=swap");
@import url("https://fonts.googleapis.com/css2?family=Chivo:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&display=swap");
@import url("https://fonts.googleapis.com/css2?family=Cairo:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&display=swap");
#ihuuxumfnp table {
  font-family: Cairo, system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
  -webkit-font-smoothing: antialiased;
  -moz-osx-font-smoothing: grayscale;
}

#ihuuxumfnp thead, #ihuuxumfnp tbody, #ihuuxumfnp tfoot, #ihuuxumfnp tr, #ihuuxumfnp td, #ihuuxumfnp th {
  border-style: none;
}

#ihuuxumfnp p {
  margin: 0;
  padding: 0;
}

#ihuuxumfnp .gt_table {
  display: table;
  border-collapse: collapse;
  line-height: normal;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: 400;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: none;
  border-top-width: 3px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#ihuuxumfnp .gt_caption {
  padding-top: 4px;
  padding-bottom: 4px;
}

#ihuuxumfnp .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#ihuuxumfnp .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 3px;
  padding-bottom: 5px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#ihuuxumfnp .gt_heading {
  background-color: #FFFFFF;
  text-align: left;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#ihuuxumfnp .gt_bottom_border {
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#ihuuxumfnp .gt_col_headings {
  border-top-style: none;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #000000;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#ihuuxumfnp .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#ihuuxumfnp .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#ihuuxumfnp .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#ihuuxumfnp .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#ihuuxumfnp .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #000000;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 5px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#ihuuxumfnp .gt_spanner_row {
  border-bottom-style: hidden;
}

#ihuuxumfnp .gt_group_heading {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: none;
  border-top-width: 2px;
  border-top-color: #000000;
  border-bottom-style: solid;
  border-bottom-width: 1px;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  text-align: left;
}

#ihuuxumfnp .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: none;
  border-top-width: 2px;
  border-top-color: #000000;
  border-bottom-style: solid;
  border-bottom-width: 1px;
  border-bottom-color: #FFFFFF;
  vertical-align: middle;
}

#ihuuxumfnp .gt_from_md > :first-child {
  margin-top: 0;
}

#ihuuxumfnp .gt_from_md > :last-child {
  margin-bottom: 0;
}

#ihuuxumfnp .gt_row {
  padding-top: 3px;
  padding-bottom: 3px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#ihuuxumfnp .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 0px;
  border-right-color: #FFFFFF;
  padding-left: 5px;
  padding-right: 5px;
}

#ihuuxumfnp .gt_stub_row_group {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 5px;
  padding-right: 5px;
  vertical-align: top;
}

#ihuuxumfnp .gt_row_group_first td {
  border-top-width: 2px;
}

#ihuuxumfnp .gt_row_group_first th {
  border-top-width: 2px;
}

#ihuuxumfnp .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#ihuuxumfnp .gt_first_summary_row {
  border-top-style: solid;
  border-top-color: #D3D3D3;
}

#ihuuxumfnp .gt_first_summary_row.thick {
  border-top-width: 2px;
}

#ihuuxumfnp .gt_last_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#ihuuxumfnp .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#ihuuxumfnp .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#ihuuxumfnp .gt_last_grand_summary_row_top {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: double;
  border-bottom-width: 6px;
  border-bottom-color: #D3D3D3;
}

#ihuuxumfnp .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#ihuuxumfnp .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#ihuuxumfnp .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#ihuuxumfnp .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#ihuuxumfnp .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#ihuuxumfnp .gt_sourcenote {
  font-size: 12px;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#ihuuxumfnp .gt_left {
  text-align: left;
}

#ihuuxumfnp .gt_center {
  text-align: center;
}

#ihuuxumfnp .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#ihuuxumfnp .gt_font_normal {
  font-weight: normal;
}

#ihuuxumfnp .gt_font_bold {
  font-weight: bold;
}

#ihuuxumfnp .gt_font_italic {
  font-style: italic;
}

#ihuuxumfnp .gt_super {
  font-size: 65%;
}

#ihuuxumfnp .gt_footnote_marks {
  font-size: 75%;
  vertical-align: 0.4em;
  position: initial;
}

#ihuuxumfnp .gt_asterisk {
  font-size: 100%;
  vertical-align: 0;
}

#ihuuxumfnp .gt_indent_1 {
  text-indent: 5px;
}

#ihuuxumfnp .gt_indent_2 {
  text-indent: 10px;
}

#ihuuxumfnp .gt_indent_3 {
  text-indent: 15px;
}

#ihuuxumfnp .gt_indent_4 {
  text-indent: 20px;
}

#ihuuxumfnp .gt_indent_5 {
  text-indent: 25px;
}

tbody tr:last-child {
  border-bottom: 2px solid #ffffff00;
}
</style>
<table class="gt_table" data-quarto-disable-processing="false" data-quarto-bootstrap="false">
  <thead>
    <tr class="gt_heading">
      <td colspan="2" class="gt_heading gt_title gt_font_normal gt_bottom_border" style="font-family: Chivo; font-weight: 700;">filter() cheatsheet</td>
    </tr>
    
    <tr class="gt_col_headings">
      <th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" style="border-top-width: 0px; border-top-style: solid; border-top-color: black; font-family: Chivo; font-size: 14px; vertical-align: bottom; font-weight: 200; text-transform: uppercase;" scope="col" id="Use Case">Use Case</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" style="border-top-width: 0px; border-top-style: solid; border-top-color: black; font-family: Chivo; font-size: 14px; vertical-align: bottom; font-weight: 200; text-transform: uppercase;" scope="col" id="Code Example">Code Example</th>
    </tr>
  </thead>
  <tbody class="gt_table_body">
    <tr><td headers="UseCase" class="gt_row gt_left">Basic Usage</td>
<td headers="CodeExample" class="gt_row gt_left"><div class='gt_from_md'><p><code>penguins %>% filter(species == 'Adelie</code>’)</p>
</div></td></tr>
    <tr><td headers="UseCase" class="gt_row gt_left">Multiple Conditions (AND)</td>
<td headers="CodeExample" class="gt_row gt_left"><div class='gt_from_md'><p><code>penguins %>% filter(species == 'Adelie' & island == 'Torgersen')</code></p>
</div></td></tr>
    <tr><td headers="UseCase" class="gt_row gt_left">Multiple Conditions (OR)</td>
<td headers="CodeExample" class="gt_row gt_left"><div class='gt_from_md'><p><code>penguins %>% filter(species == 'Adelie' | species == 'Chinstrap')</code></p>
</div></td></tr>
    <tr><td headers="UseCase" class="gt_row gt_left">Exclude with NOT</td>
<td headers="CodeExample" class="gt_row gt_left"><div class='gt_from_md'><p><code>penguins %>% filter(!(species == 'Adelie'))</code></p>
</div></td></tr>
    <tr><td headers="UseCase" class="gt_row gt_left">Filtering with %in%</td>
<td headers="CodeExample" class="gt_row gt_left"><div class='gt_from_md'><p><code>penguins %>% filter(island %in% c('Dream', 'Torgersen'))</code></p>
</div></td></tr>
    <tr><td headers="UseCase" class="gt_row gt_left">Between Two Values</td>
<td headers="CodeExample" class="gt_row gt_left"><div class='gt_from_md'><p><code>penguins %>% filter(bill_length_mm >= 40, bill_length_mm <= 50)</code></p>
</div></td></tr>
    <tr><td headers="UseCase" class="gt_row gt_left">NA Handling</td>
<td headers="CodeExample" class="gt_row gt_left"><div class='gt_from_md'><p><code>penguins %>% filter(is.na(bill_depth_mm)) or penguins %>% filter(!is.na(bill_depth_mm))</code></p>
</div></td></tr>
  </tbody>
  
  
</table>
</div>

###🐧 Ex. 2.2 Can you remove NAs before you count penguins by gender?

---
Your code should have looked something like this:

```r
penguins %>% 
  filter(!is.na(sex)) %>% 
  group_by(island, sex) %>% 
  summarise(count = n())
```

```
## # A tibble: 6 × 3
## # Groups:   island [3]
##   island    sex    count
##   <fct>     <fct>  <int>
## 1 Biscoe    female    80
## 2 Biscoe    male      83
## 3 Dream     female    61
## 4 Dream     male      62
## 5 Torgersen female    24
## 6 Torgersen male      23
```

---

## Counting's cool, but there's more!

When we analyse data, there are a few important things we try to look for:

1. **Data Distribution**: Understanding how data is spread out. What's the overall shape or pattern of the data?

2. **Central Tendencies**: Identifying typical or average values in the data, such as the mean (average), median (middle value), and mode (most frequent).

3. **Data Variation**: Examining the extent to which data points differ from each other. How consistent or varied are the values?

With these concepts, we can effectively describe and summarize data, enhancing both analysis and visualization.

To help keep us aligned on our path, we're only going to look at one question.

---

# Quantifying chonkiness

We need to answer the all important question of **which penguins are the chonkiest, and more importantly, which are RELIABLY chonky**.

For the next few demonstrations, we'll choose `body_mass_g` as our variable of interest.

![chonking](images/chonk.jpg)

---

## Back to school

We're probably aware of what central tendencies are, but a quick refresher. Given a dataset of values like so:

```r
df <- c(13, 15, 105, 2, 25, 30, 35, 40, 10, 15)
```

#### Mean

The mean is just the average of all these values. No worries about formulas for now, R has a built in function.

```r
mean(df)
```

```
## [1] 29
```

#### Median

The median is the middle value of the  **sorted** numbers, separating the lower half from the higher half.

```r
median(df)
```

```
## [1] 20
```

#### Mode

The mode is the most frequently occurring value.

```
## [1] 15
```

---

### 🐧 Ex. 2.3 Use `summarise()` to calculate the mean and median of the `body_mass_g` by species and gender for the entire dataset.

I've partially done it for you.

```r
penguins %>%
  group_by(species) %>% 
  summarise(mean_body_mass = mean(body_mass_g, na.rm = T),
            median_body_mass = ) 
```
> Hint: You can create a new column by adding a similar line after a comma. Also, what is the `na.rm` doing?

---

We want something like this.

```r
penguins %>%
  filter(!is.na(sex)) %>% 
  group_by(species, sex) %>% 
  summarise(mean_body_mass = mean(body_mass_g, na.rm = T),
            median_body_mass = median(body_mass_g, na.rm = T)) 
```

```
## # A tibble: 6 × 4
## # Groups:   species [3]
##   species   sex    mean_body_mass median_body_mass
##   <fct>     <fct>           <dbl>            <dbl>
## 1 Adelie    female          3369.             3400
## 2 Adelie    male            4043.             4000
## 3 Chinstrap female          3527.             3550
## 4 Chinstrap male            3939.             3950
## 5 Gentoo    female          4680.             4700
## 6 Gentoo    male            5485.             5500
```

---

The mean by itself often doesn't tell the full story. Have a look at when the plot the real data using a **histogram.**

A histogram is a graphical representation of the distribution of numerical data. It shows the **frequency of values** within specific intervals, known as bins. The histogram below would show how many penguins fall into different weight categories.

Look at the distributions below.

![](xaringan-crash-course_files/figure-html/unnamed-chunk-21-1.png)

---

### Observations from our histograms:

1. Each species shows a different spread of body mass values. Some are more symmetrical, **while others are slightly skewed to the left or right**. Mean and median do not capture this aspect of the data.

2. While the means of the Adelie and Chinstrap penguins are close, **their distributions are somewhat different**.

3. Knowing just the mean body mass might lead to the assumption that the distributions are similar. However, the spread of the data can be vastly different.

So what we also want to know is how **different this data is around this mean**.

---

This difference is more visible when I split this further by species.

![](xaringan-crash-course_files/figure-html/unnamed-chunk-22-1.png)

Clearly, the male and female body weights not only have different central tendencies, but also **different ranges**. They vary differently. And this is important to describe.

**I need to be able to tell you which penguin is most reliably chonky**.

---

## Describing differentness

A simple way to describe how different the values are is just giving the range of values -- what is the maximum minus the minimum?

```r
penguins %>% 
  filter(!is.na(sex)) %>% 
  group_by(species, sex) %>% 
  summarise(mean = mean(body_mass_g, na.rm = T), 
            range = max(body_mass_g) - min(body_mass_g)) %>% 
  arrange(-range)
```

```
## # A tibble: 6 × 4
## # Groups:   species [3]
##   species   sex     mean range
##   <fct>     <fct>  <dbl> <int>
## 1 Chinstrap male   3939.  1550
## 2 Gentoo    male   5485.  1550
## 3 Adelie    male   4043.  1450
## 4 Chinstrap female 3527.  1450
## 5 Gentoo    female 4680.  1250
## 6 Adelie    female 3369.  1050
```
--

Hmm, that is a start. This tells me that there's a difference of `1550g` between the heaviest and the lightest Gentoo penguin.

But the range only gives us the *extremes.* While we could also use quantiles to describe this data, a more common measure is the standard deviation, which tells us **how much the body mass of penguins tends to deviate from the mean**.

---
## Measuring spread

SD describes the spread of the data, but also gives a sense of how tightly clustered the data points are around the mean. A lower standard deviation means the data points are more closely packed around the mean, while a higher standard deviation indicates more spread out data.

We're not going to worry about the formula, the `sd()` function in R is nice enough for now.

---

```r
penguins %>% 
  filter(!is.na(sex)) %>% 
  group_by(species, sex) %>% 
  summarise(
    mean = mean(body_mass_g, na.rm = T),
    sd = sd(body_mass_g, na.rm = T)
  ) %>% 
  arrange(-sd)
```

```
## # A tibble: 6 × 4
## # Groups:   species [3]
##   species   sex     mean    sd
##   <fct>     <fct>  <dbl> <dbl>
## 1 Chinstrap male   3939.  362.
## 2 Adelie    male   4043.  347.
## 3 Gentoo    male   5485.  313.
## 4 Chinstrap female 3527.  285.
## 5 Gentoo    female 4680.  282.
## 6 Adelie    female 3369.  269.
```

This is neat! A smaller standard deviation indicates that the penguin weights are clustered closely around the mean, suggesting a more uniform body mass distribution within the species. Conversely, a large standard deviation implies a wide spread of weights.

**Not only are males generally heavier than their female counterparts across all species, but the variability in weight is also generally greater among males. **

With information about the mean and the standard deviation, you can also understand a lot about the data.

---

.center2[
<small>Now Aman will try to smoothly transition into a whiteboard as if he's just thought of something and not practiced it a couple of times. Fake.

</small>

]

.center[
<small>I told you so.</small>
]

---

### 🐧 Ex. 2.4 Your exercise notebook has a mystery dataset. What do you notice about the mean of the X and Y column? Is this differentness or similarity reflected in their distributions?

Remember to plot it 👀

---

### Confidence intervals

Confidence intervals (CI from hereon) are an extension of standard deviation. While standard deviation gives us a sense of the spread of data, confidence intervals provide a range within which we expect the true population mean to fall. This is typically expressed with a certain level of confidence, often 95%.

I'll create confidence intervals for this data.

```
## # A tibble: 6 × 6
## # Groups:   species, sex [6]
##   species   sex    estimate conf.low conf.high  p.value
##   <fct>     <fct>     <dbl>    <dbl>     <dbl>    <dbl>
## 1 Adelie    female    3369.    3306.     3432. 4.64e-81
## 2 Adelie    male      4043.    3963.     4124. 7.00e-79
## 3 Chinstrap female    3527.    3428.     3627. 6.96e-38
## 4 Chinstrap male      3939.    3813.     4065. 4.61e-36
## 5 Gentoo    female    4680.    4606.     4754. 1.54e-71
## 6 Gentoo    male      5485.    5405.     5565. 1.41e-76
```

--
#### What this ugly table is trying to say is this:

Imagine a scenario where you go rogue in Antartica and are hell-bent on measuring every flipping flapping penguin you encounter (god that would be something).

For all the Adelie female penguins you come across, **95% of them will lie between 3306 and 3432** and the estimated mean of that population is **3368g**.

<small style="font-size: 8px;"> (My father the epidemiologist would like to point out that this is a crude and unorthodox explanation of CI but says it makes sense for most purposes so well. eh.) </small>

---

## 'Ze hard way

I can generate these confidence intervals in a lot of ways. I can _manually_ calculate them:

```r
confidence_interval <- penguins %>%
  filter(!is.na(body_mass_g), !is.na(sex)) %>%
  group_by(species, sex) %>%
  summarise(
    sample_mean = mean(body_mass_g),
    sample_sd = sd(body_mass_g),
    n = n(),
    se = sample_sd / sqrt(n),
    ci_lower = sample_mean - qt(0.975, n - 1) * se,
    ci_upper = sample_mean + qt(0.975, n - 1) * se
  )

confidence_interval
```

```
## # A tibble: 6 × 8
## # Groups:   species [3]
##   species   sex    sample_mean sample_sd     n    se ci_lower ci_upper
##   <fct>     <fct>        <dbl>     <dbl> <int> <dbl>    <dbl>    <dbl>
## 1 Adelie    female       3369.      269.    73  31.5    3306.    3432.
## 2 Adelie    male         4043.      347.    73  40.6    3963.    4124.
## 3 Chinstrap female       3527.      285.    34  48.9    3428.    3627.
## 4 Chinstrap male         3939.      362.    34  62.1    3813.    4065.
## 5 Gentoo    female       4680.      282.    58  37.0    4606.    4754.
## 6 Gentoo    male         5485.      313.    61  40.1    5405.    5565.
```

**You don't need to understand this code.** But fwiw, its just the formula for calculating CI using the summarise() function which we now know. But anyway, this is not ideal.

---

## Intervals (t's version)

An easier way is with something called the `t.test()` function. What a t-test does is beyond the scope of this session, but given a data vector, of say, these values:

```r
df <- c(3,4,6,8,5,5,3,8,4,9,1)
t.test(df)
```
--

It returns the **mean** of that vector, and a set of **95% confidence interval bounds**, which tells us the range between which the true mean is likely to fall for this data.

```r
	One Sample t-test
data:  df
t = 6.8415, df = 10, p-value = 4.505e-05
alternative hypothesis: true mean is not equal to 0
*95 percent confidence interval:
*3.432900 6.748918
sample estimates:
*mean of x
*5.090909
```
 
--
 
Which can actually be made into a neat dataset because of the `tidy()` function from the `broom` package.

```r
df <- c(3,4,6,8,5,5,3,8,4,9,1)
broom::tidy(t.test(df)) %>% select(estimate, conf.low, conf.high)
```

```
## # A tibble: 1 × 3
##   estimate conf.low conf.high
##      <dbl>    <dbl>     <dbl>
## 1     5.09     3.43      6.75
```

---
Allow me to do it for the data on `body_mass_g` that we are interested in and also plot it in a tie-fighter plot*.

![](xaringan-crash-course_files/figure-html/unnamed-chunk-30-1.png)
<small>*Only a Sith deals in absolutes. The rest of us use confidence intervals.</small>

---

## Learning by tinkering

This is a section of the code for the previous slide, it is also in your exercise notebook. It uses a variety of data wrangling techniques to make it work within a tidyverse-style pipeline.

```r
 penguins %>%
  filter(!is.na(sex)) %>%
  group_by(species, sex) %>%
  nest() %>%
  mutate(model = map(data, ~ t.test(.$body_mass_g)),
         tidy_model = map(model, broom::tidy)) %>% 
  unnest(tidy_model) %>%
  mutate(sex = fct_reorder(sex, estimate)) %>% 
  select(species,sex,estimate,conf.low,conf.high) %>% 
  head(4)
```

```
## # A tibble: 4 × 5
## # Groups:   species, sex [4]
##   species sex    estimate conf.low conf.high
##   <fct>   <fct>     <dbl>    <dbl>     <dbl>
## 1 Adelie  male      4043.    3963.     4124.
## 2 Adelie  female    3369.    3306.     3432.
## 3 Gentoo  female    4680.    4606.     4754.
## 4 Gentoo  male      5485.    5405.     5565.
```

Like I have said before, **it is not necessary for you to completely understand what is happening here** but the way to approach this is by breaking it down line by line -- execute only a part of the pipeline and see how the data is being transformed.

---

## ...continued

If you only run the part till the `nest()` command, you'll see how I collapsed the data into smaller rows:

![nest](images/nest.png)

Or, you could **change the variable going in the t.test() function from `body_mass_g` to something else**, what happens? Group by another variable; use this code as a launchpad to do your own thing.

<small>I also endorse ChatGPT as a learning device. Ask it to explain and break things down, use it in a way that helps you _learn_, not just treat this as a chore.</small>

---

## Wow that was intense

Maybe this will take time to settle in. But you made it!

Here is a recap in a few lines:

1. We counted the number of penguins and understood **how many of each there are**.

2. We became interested in their chonkiness, and calculated their **average weights**.

3. We then observed this average in relation to the distribution using **histograms**.

4. We reasoned that the mean does not describe the shape and spread of the data by itself and calculated the **ranges**.

5. Looking at the ranges, we realized that this only shows us the extremes of the data, and still did not describe **how these values were distributed around the mean**.

6. We used **standard deviation to observe the spread of data**, and worked backwards from the mean and this new value to create a distribution of the data.

7. We then calculated the confidence intervals which allowed us to **estimate the range in which the true average weight of the penguin population likely falls**, based on our sample data.

---

# So who da chonkiest?

Spoilers ahead

![gentoooooo](images/chonking.png)

---

### 🐧 Your turn!

- Ex. 2.7: Your exercise notebook has all that you need to create a similar chart, and a couple of others. **Can you do a similar analysis of `bill_length` or `bill_depth`**?

Hint: _It is legal to copy-paste code and change variables!_

- Ex. 2.8: Find the mean flipper_length_mm for penguins with body_mass_g above and below the median.

Hint: _Hint: Use filter() to create two subsets based on body_mass_g relative to its median, then calculate the means._

- Ex. 2.9: Create a histogram of `flipper_length_mm` for each species to visualize the distribution.

Hint: _Reuse my histogram code if you need to!_
--

- Ex. 3.0: Visit the [R Graph Gallery](https://r-graph-gallery.com/). Find a visualization you like, copy paste the code and replace variable names to arrive at a similar visualization. Go crazy.

---

## Further resources

There's nothing better than practice. Here's some things I continuously refer to, which are non-imposing and easy to follow and get into:

1. [Tidytuesday](https://tidytues.day) project, where basically everyone who's participating works on the same dataset. And its fun! On Twitter you can even find this week's things if you search for the hashtag, everyone shares their code as well.

2. Watch one of the best analyze something live! Dave Robinson (G.O.A.T) who has these amazing screencasts (annotated [here](https://github.com/dgrtwo/data-screencasts/tree/master/screencast-annotations)). So many topics, so many datasets. The best mentor you can ask for.

3. There's also the [R4DS Slack](https://join.slack.com/t/rfordatascience/shared_invite/zt-26see23w5-Md3Fx2NQQVAExK2RtMcyog), which is helpful if you want to ask questions. They also have book clubs, where everyone works through a book together.

4. For books, there is [Modern Dive](https://moderndive.com/index.html) and the [R4DS book](https://r4ds.hadley.nz/) which are great.

5. If you're anything like I am, you probably like to do something first and then understand how it works, you can see a small glimpse of what is possible at the [r-graph-gallery](https://r-graph-gallery.com/), tinker around and then make stuff with your own data.

<small style="font-size:10px;"> I literally paraphrased a message to Prakriti for this slide lmao </small>

---

.center[## Okay bye]

.center2[So long. Thanks for joining.

![pingu](images/pingu-penguin.gif)
]

---

## References

<p><cite>Crump, M., D. Navarro, and J. Suzuki
(2022).
&ldquo;Answering Questions with Data (Textbook): Introductory Statistics for Psychology Students&rdquo;.
</p>
<p>DOI: <a href="https://doi.org/10.17605/OSF.IO/JZE52">10.17605/OSF.IO/JZE52</a>.
(Visited on Dec. 10, 2023).</cite></p>
<p><cite>R for Data Science [Book]
(2023).
<em>R for Data Science [Book]</em>.
https://www.oreilly.com/library/view/r-for-data/9781491910382/.
(Visited on Dec. 07, 2023).</cite></p>
<p><cite>Grolemund, H. W. a. G.
(2023).
<em>Welcome | R for Data Science</em>.
(Visited on Dec. 07, 2023).</cite></p>
<p><cite>Ismay, C. and A. Y. Kim
(2023).
<em>Statistical Inference via Data Science: A ModernDive into R and the Tidyverse</em>.
https://www.routledge.com/Statistical-Inference-via-Data-Science-A-ModernDive-into-R-and-the-Tidyverse/Ismay-Kim/p/book/9780367409821.
(Visited on Dec. 10, 2023).</cite></p>
<p><cite>Practice of Statistics in the Life Sciences 4th Edition | Brigitte Baldi | Macmillan Learning
(2023).
<em>Practice of Statistics in the Life Sciences 4th Edition | Brigitte Baldi | Macmillan Learning</em>.
https://store.macmillanlearning.com/ca/product/Practice-of-Statistics-in-the-Life-Sciences-4th-edition/p/1319013376.
(Visited on Dec. 10, 2023).</cite></p>
<p><cite>Robinson, D.
(2023).
<em>(170) Tidy Tuesday Screencast: Analyzing NYC Restaurant Inspections with R - YouTube</em>.
https://www.youtube.com/watch?v=em4FXPf4H-Y&amp;t=1620s.
(Visited on Dec. 10, 2023).</cite></p>