By the end of this short session, we want to be able to:
Understand types of data we might encounter, and how to describe it
Be able to derive some basic "summaries" of our data; a stepping stone to greater things.
Focus on a single question and understand how we can answer it using basic tools.
Plot basic histograms and another star-wars themed plot (okay thats a stretch).
This session will skim over a lot of things quickly. The intention is to not become full-blown programmers or statisticians in one day, but introduce ourselves to certain tools and mental models that will go a long way in helping us explore the potential of the data we might handle on a daily basis.
By the end of this short session, we want to be able to:
Understand types of data we might encounter, and how to describe it
Be able to derive some basic "summaries" of our data; a stepping stone to greater things.
Focus on a single question and understand how we can answer it using basic tools.
Plot basic histograms and another star-wars themed plot (okay thats a stretch).
This session will skim over a lot of things quickly. The intention is to not become full-blown programmers or statisticians in one day, but introduce ourselves to certain tools and mental models that will go a long way in helping us explore the potential of the data we might handle on a daily basis.
Simply becoming familiar with the fact that these tools exist and having a glimpse of the things you can do with them is a huge step forward. If stuff goes over your head on the first try, that is okay. This is a process.
This session comes with two set of materials:
This slide deck right here, which is what you can follow along with. This is me in PDF format.
Whenever you see a 🐧️ emoji on a slide, that means you need to refer to that section of the exercise notebook.
Download your copy here. Store it in the same folder you have created a new project for this session.
Based on our poll, we decided to go with penguins to help us get started with our learning, and so be it! Meet the Palmer Penguins, who have generously consented to being studied by you today. We'll learn more about them shortly.
Artwork by Allison Horst
We need to bring this party down from the Palmer Station, Antarctica before we start.
To do that, open up your notebooks and enter install.packages('palmerpenguins') into the console.
Once it is downloaded, locate the Ex. 1.1 heading and add a new code block. You can do this by pressing Ctrl+ Alt + i.
In your code block, load the fellas into your environment by typing library(palmerpenguins).
To run your code, either press Ctrl + Enter to run that line, or the green Play icon on the right of the code chunk.
Sure would be nice to see what data we have here, right? There's two ways to do that.
Pressing Ctrl + Enter on a code chunk line executes it.
penguins # Press Ctrl + Enter inside the code chunk
## # A tibble: 344 × 8## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g## <fct> <fct> <dbl> <dbl> <int> <int>## 1 Adelie Torgersen 39.1 18.7 181 3750## 2 Adelie Torgersen 39.5 17.4 186 3800## 3 Adelie Torgersen 40.3 18 195 3250## 4 Adelie Torgersen NA NA NA NA## 5 Adelie Torgersen 36.7 19.3 193 3450## 6 Adelie Torgersen 39.3 20.6 190 3650## 7 Adelie Torgersen 38.9 17.8 181 3625## 8 Adelie Torgersen 39.2 19.6 195 4675## 9 Adelie Torgersen 34.1 18.1 193 3475## 10 Adelie Torgersen 42 20.2 190 4250## # ℹ 334 more rows## # ℹ 2 more variables: sex <fct>, year <int>Sure would be nice to see what data we have here, right? There's two ways to do that.
Pressing Ctrl + Enter on a code chunk line executes it.
penguins # Press Ctrl + Enter inside the code chunk
## # A tibble: 344 × 8## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g## <fct> <fct> <dbl> <dbl> <int> <int>## 1 Adelie Torgersen 39.1 18.7 181 3750## 2 Adelie Torgersen 39.5 17.4 186 3800## 3 Adelie Torgersen 40.3 18 195 3250## 4 Adelie Torgersen NA NA NA NA## 5 Adelie Torgersen 36.7 19.3 193 3450## 6 Adelie Torgersen 39.3 20.6 190 3650## 7 Adelie Torgersen 38.9 17.8 181 3625## 8 Adelie Torgersen 39.2 19.6 195 4675## 9 Adelie Torgersen 34.1 18.1 193 3475## 10 Adelie Torgersen 42 20.2 190 4250## # ℹ 334 more rows## # ℹ 2 more variables: sex <fct>, year <int>You can also use View(penguins) in your console to open it as a spreadsheet.
Another fun thing to do is see the structure of your dataset.
str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...Another fun thing to do is see the structure of your dataset.
str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...What are the various variable types you can see?
Before we do ANYTYHING, we need to include tidyverse. Which begs the question
Tidyverse is an awesome collection of packages designed for data science. It includes packages for:

Wanna see the difference? You don't have to know code to see how cool this is.
Scenario: We've got a bunch of penguins. We need to find all Adelie penguins with their bills greater than 40mm and convert this to centimeters, and sort them. Here is this scenario in two coding flavors:
Scenario: We've got a bunch of penguins. We need to find all Adelie penguins with their bills greater than 40mm and convert this to centimeters, and sort them. Here is this scenario in two coding flavors:
result <- penguins[penguins$species == "Adelie" & penguins$bill_length_mm > 40, ]result$flipper_length_cm <- result$flipper_length_mm / 10names(result)[names(result) == 'flipper_length_mm'] <- 'flipper_length_cm'result <- result[order(result$flipper_length_cm, decreasing = TRUE), ]# It's like assembling IKEA furniture with a spoon.Scenario: We've got a bunch of penguins. We need to find all Adelie penguins with their bills greater than 40mm and convert this to centimeters, and sort them. Here is this scenario in two coding flavors:
result <- penguins[penguins$species == "Adelie" & penguins$bill_length_mm > 40, ]result$flipper_length_cm <- result$flipper_length_mm / 10names(result)[names(result) == 'flipper_length_mm'] <- 'flipper_length_cm'result <- result[order(result$flipper_length_cm, decreasing = TRUE), ]# It's like assembling IKEA furniture with a spoon.
That's so ugly it is almost Python.
Scenario: We've got a bunch of penguins. We need to find all Adelie penguins with their bills greater than 40mm and convert this to centimeters, and sort them. Here is this scenario in two coding flavors:
result <- penguins[penguins$species == "Adelie" & penguins$bill_length_mm > 40, ]result$flipper_length_cm <- result$flipper_length_mm / 10names(result)[names(result) == 'flipper_length_mm'] <- 'flipper_length_cm'result <- result[order(result$flipper_length_cm, decreasing = TRUE), ]# It's like assembling IKEA furniture with a spoon.
That's so ugly it is almost Python.
result <- penguins %>% filter(species == "Adelie", bill_length_mm > 40) %>% mutate(flipper_length_cm = flipper_length_mm / 10) %>% select(-flipper_length_mm) %>% arrange(desc(flipper_length_cm))The Tidyverse in R is built around the concept of a "grammar of data manipulation". This grammar is made up of distinct "verbs" that correspond to common data manipulation tasks.
Which is incredibly useful because this language is close to how our cognitive processes work! For example, if I were to spell out my process from the earlier slide:
The Tidyverse in R is built around the concept of a "grammar of data manipulation". This grammar is made up of distinct "verbs" that correspond to common data manipulation tasks.
Which is incredibly useful because this language is close to how our cognitive processes work! For example, if I were to spell out my process from the earlier slide:
The Tidyverse in R is built around the concept of a "grammar of data manipulation". This grammar is made up of distinct "verbs" that correspond to common data manipulation tasks.
Which is incredibly useful because this language is close to how our cognitive processes work! For example, if I were to spell out my process from the earlier slide:
bill_length_mm > 40 and thenThe Tidyverse in R is built around the concept of a "grammar of data manipulation". This grammar is made up of distinct "verbs" that correspond to common data manipulation tasks.
Which is incredibly useful because this language is close to how our cognitive processes work! For example, if I were to spell out my process from the earlier slide:
filter() to keep only rows where species = "Adelie" and bill_length_mm > 40 and then
Add a new column (or mutate() the original dataset), converting flipper length from mm to cm and then
The Tidyverse in R is built around the concept of a "grammar of data manipulation". This grammar is made up of distinct "verbs" that correspond to common data manipulation tasks.
Which is incredibly useful because this language is close to how our cognitive processes work! For example, if I were to spell out my process from the earlier slide:
filter() to keep only rows where species = "Adelie" and bill_length_mm > 40 and then
Add a new column (or mutate() the original dataset), converting flipper length from mm to cm and then
Exclude the original flipper_length_mm column ( select() everything except that) and then
The Tidyverse in R is built around the concept of a "grammar of data manipulation". This grammar is made up of distinct "verbs" that correspond to common data manipulation tasks.
Which is incredibly useful because this language is close to how our cognitive processes work! For example, if I were to spell out my process from the earlier slide:
filter() to keep only rows where species = "Adelie" and bill_length_mm > 40 and then
Add a new column (or mutate() the original dataset), converting flipper length from mm to cm and then
Exclude the original flipper_length_mm column ( select() everything except that) and then
arrange() to sort the data in descending order of flipper_length_cm.
The Tidyverse in R is built around the concept of a "grammar of data manipulation". This grammar is made up of distinct "verbs" that correspond to common data manipulation tasks.
Which is incredibly useful because this language is close to how our cognitive processes work! For example, if I were to spell out my process from the earlier slide:
filter() to keep only rows where species = "Adelie" and bill_length_mm > 40 and then
Add a new column (or mutate() the original dataset), converting flipper length from mm to cm and then
Exclude the original flipper_length_mm column ( select() everything except that) and then
arrange() to sort the data in descending order of flipper_length_cm.
The concept of 'and then' is represented by something called the pipe operator %>%. It seamlessly passes the output of one step as the input to the next, sort of like water flowing through....well I guess a pipe.
Pro tip: When you're in a code chunk, you can just press Ctrl + M and it will add the pipe! No need to type it manually. Try it.
The verbs we want to worry about today come from the dplyr package, which is part of the tidyverse. While some of the exercises may include more of them, the ones you need to get comfortable with for most of your work are:
select() - Used for selecting, or removing, columns from your dataset.filter() - Used for filtering rows based on certain values.mutate() - Used for adding a new column, and then inserting whatever you want into it.arrange() - Used for sorting your data based on a column, or group of columns. group_by() - Used for creating groups within your data's categorical variables.summarize() - Used for creating a summary out of your data, reducing maybe 100 rows to just 2.The verbs we want to worry about today come from the dplyr package, which is part of the tidyverse. While some of the exercises may include more of them, the ones you need to get comfortable with for most of your work are:
select() - Used for selecting, or removing, columns from your dataset.filter() - Used for filtering rows based on certain values.mutate() - Used for adding a new column, and then inserting whatever you want into it.arrange() - Used for sorting your data based on a column, or group of columns. group_by() - Used for creating groups within your data's categorical variables.summarize() - Used for creating a summary out of your data, reducing maybe 100 rows to just 2.You get all of this goodness and more by including the tidyverse package in the beginning of your notebook.
library(tidyverse)The verbs we want to worry about today come from the dplyr package, which is part of the tidyverse. While some of the exercises may include more of them, the ones you need to get comfortable with for most of your work are:
select() - Used for selecting, or removing, columns from your dataset.filter() - Used for filtering rows based on certain values.mutate() - Used for adding a new column, and then inserting whatever you want into it.arrange() - Used for sorting your data based on a column, or group of columns. group_by() - Used for creating groups within your data's categorical variables.summarize() - Used for creating a summary out of your data, reducing maybe 100 rows to just 2.You get all of this goodness and more by including the tidyverse package in the beginning of your notebook.
library(tidyverse)
But enough talk, let us explore our data now.
Usually, when you're faced with a new dataset, you want to get a sense what all those numbers look like. Things like does a certain category dominate the dataset? How many things are we looking at? What are most of those things (the average) like? And so on.
Here's what I do:
You have already done a little bit of the first two, let is try counting a few things.
Usually, when you're faced with a new dataset, you want to get a sense what all those numbers look like. Things like does a certain category dominate the dataset? How many things are we looking at? What are most of those things (the average) like? And so on.
Here's what I do:
You have already done a little bit of the first two, let is try counting a few things.
Formulate your question into a list of steps:
Formulate your question into a list of steps:
Formulate your question into a list of steps:
penguins %>% group_by(island) %>% summarise(count = n())
## # A tibble: 3 × 2## island count## <fct> <int>## 1 Biscoe 168## 2 Dream 124## 3 Torgersen 52Any time you see a function that you don't recognize, enter ? followed by the function to read about it.
?n()Any time you see a function that you don't recognize, enter ? followed by the function to read about it.
?n()
Reading the help window, we know how that n() helps us count the number of items in a group! We could ALSO have done this using count(), another dplyr function
penguins %>% group_by(island) %>% count()
## # A tibble: 3 × 2## # Groups: island [3]## island n## <fct> <int>## 1 Biscoe 168## 2 Dream 124## 3 Torgersen 52Would you like to see what is happening here? Go to this link to check out the step-by-step process.
summarize() takes our grouped rows, does the counting and returns only the necessary data. A summary!

Would you like to see what is happening here? Go to this link to check out the step-by-step process.
summarize() takes our grouped rows, does the counting and returns only the necessary data. A summary!

You can use this website to input your code to see how each steps looks like (as long as it has the palmerpenguins data)
Hint: Try running ?group_by() in the console. Does it allow more than one variable? How can you add another one?
Next slide has the answer.
For extra marks, visualize your code solution on the Tidy Data Visualizer and understand what is happening.
penguins %>% group_by(island, sex) %>% summarise(count = n())
## # A tibble: 9 × 3## # Groups: island [3]## island sex count## <fct> <fct> <int>## 1 Biscoe female 80## 2 Biscoe male 83## 3 Biscoe <NA> 5## 4 Dream female 61## 5 Dream male 62## 6 Dream <NA> 1## 7 Torgersen female 24## 8 Torgersen male 23## 9 Torgersen <NA> 5Oh wait, are those some NAs I see there? That means that value was not recorded and is not available. You want to filter them out before you count anything right?
Oh wait, are those some NAs I see there? That means that value was not recorded and is not available. You want to filter them out before you count anything right?
The filter() command can help in the following cases:
# Only keep those after 2008penguins %>% filter(year > 2008) %>% head(2) # Shows only the first two rows
## # A tibble: 2 × 8## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g## <fct> <fct> <dbl> <dbl> <int> <int>## 1 Adelie Biscoe 35 17.9 192 3725## 2 Adelie Biscoe 41 20 203 4725## # ℹ 2 more variables: sex <fct>, year <int>Oh wait, are those some NAs I see there? That means that value was not recorded and is not available. You want to filter them out before you count anything right?
The filter() command can help in the following cases:
# Only keep those after 2008penguins %>% filter(year > 2008) %>% head(2) # Shows only the first two rows
## # A tibble: 2 × 8## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g## <fct> <fct> <dbl> <dbl> <int> <int>## 1 Adelie Biscoe 35 17.9 192 3725## 2 Adelie Biscoe 41 20 203 4725## # ℹ 2 more variables: sex <fct>, year <int>penguins %>% filter(island != "Adelie") %>% head(2)| filter() cheatsheet | |
| Use Case | Code Example |
|---|---|
| Basic Usage |
|
| Multiple Conditions (AND) |
|
| Multiple Conditions (OR) |
|
| Exclude with NOT |
|
| Filtering with %in% |
|
| Between Two Values |
|
| NA Handling |
|
| filter() cheatsheet | |
| Use Case | Code Example |
|---|---|
| Basic Usage |
|
| Multiple Conditions (AND) |
|
| Multiple Conditions (OR) |
|
| Exclude with NOT |
|
| Filtering with %in% |
|
| Between Two Values |
|
| NA Handling |
|
Your code should have looked something like this:
penguins %>% filter(!is.na(sex)) %>% group_by(island, sex) %>% summarise(count = n())
## # A tibble: 6 × 3## # Groups: island [3]## island sex count## <fct> <fct> <int>## 1 Biscoe female 80## 2 Biscoe male 83## 3 Dream female 61## 4 Dream male 62## 5 Torgersen female 24## 6 Torgersen male 23When we analyse data, there are a few important things we try to look for:
When we analyse data, there are a few important things we try to look for:
Data Distribution: Understanding how data is spread out. What's the overall shape or pattern of the data?
Central Tendencies: Identifying typical or average values in the data, such as the mean (average), median (middle value), and mode (most frequent).
When we analyse data, there are a few important things we try to look for:
Data Distribution: Understanding how data is spread out. What's the overall shape or pattern of the data?
Central Tendencies: Identifying typical or average values in the data, such as the mean (average), median (middle value), and mode (most frequent).
Data Variation: Examining the extent to which data points differ from each other. How consistent or varied are the values?
With these concepts, we can effectively describe and summarize data, enhancing both analysis and visualization.
When we analyse data, there are a few important things we try to look for:
Data Distribution: Understanding how data is spread out. What's the overall shape or pattern of the data?
Central Tendencies: Identifying typical or average values in the data, such as the mean (average), median (middle value), and mode (most frequent).
Data Variation: Examining the extent to which data points differ from each other. How consistent or varied are the values?
With these concepts, we can effectively describe and summarize data, enhancing both analysis and visualization.
To help keep us aligned on our path, we're only going to look at one question.
We need to answer the all important question of which penguins are the chonkiest, and more importantly, which are RELIABLY chonky.
For the next few demonstrations, we'll choose body_mass_g as our variable of interest.

We're probably aware of what central tendencies are, but a quick refresher. Given a dataset of values like so:
df <- c(13, 15, 105, 2, 25, 30, 35, 40, 10, 15)We're probably aware of what central tendencies are, but a quick refresher. Given a dataset of values like so:
df <- c(13, 15, 105, 2, 25, 30, 35, 40, 10, 15)
The mean is just the average of all these values. No worries about formulas for now, R has a built in function.
mean(df)
## [1] 29We're probably aware of what central tendencies are, but a quick refresher. Given a dataset of values like so:
df <- c(13, 15, 105, 2, 25, 30, 35, 40, 10, 15)
The mean is just the average of all these values. No worries about formulas for now, R has a built in function.
mean(df)
## [1] 29The median is the middle value of the sorted numbers, separating the lower half from the higher half.
median(df)
## [1] 20We're probably aware of what central tendencies are, but a quick refresher. Given a dataset of values like so:
df <- c(13, 15, 105, 2, 25, 30, 35, 40, 10, 15)
The mean is just the average of all these values. No worries about formulas for now, R has a built in function.
mean(df)
## [1] 29The median is the middle value of the sorted numbers, separating the lower half from the higher half.
median(df)
## [1] 20The mode is the most frequently occurring value.
## [1] 15summarise() to calculate the mean and median of the body_mass_g by species and gender for the entire dataset.I've partially done it for you.
penguins %>% group_by(species) %>% summarise(mean_body_mass = mean(body_mass_g, na.rm = T), median_body_mass = )
Hint: You can create a new column by adding a similar line after a comma. Also, what is the
na.rmdoing?
We want something like this.
penguins %>% filter(!is.na(sex)) %>% group_by(species, sex) %>% summarise(mean_body_mass = mean(body_mass_g, na.rm = T), median_body_mass = median(body_mass_g, na.rm = T))
## # A tibble: 6 × 4## # Groups: species [3]## species sex mean_body_mass median_body_mass## <fct> <fct> <dbl> <dbl>## 1 Adelie female 3369. 3400## 2 Adelie male 4043. 4000## 3 Chinstrap female 3527. 3550## 4 Chinstrap male 3939. 3950## 5 Gentoo female 4680. 4700## 6 Gentoo male 5485. 5500The mean by itself often doesn't tell the full story. Have a look at when the plot the real data using a histogram.
A histogram is a graphical representation of the distribution of numerical data. It shows the frequency of values within specific intervals, known as bins. The histogram below would show how many penguins fall into different weight categories.
Look at the distributions below.

Each species shows a different spread of body mass values. Some are more symmetrical, while others are slightly skewed to the left or right. Mean and median do not capture this aspect of the data.
While the means of the Adelie and Chinstrap penguins are close, their distributions are somewhat different.
Each species shows a different spread of body mass values. Some are more symmetrical, while others are slightly skewed to the left or right. Mean and median do not capture this aspect of the data.
While the means of the Adelie and Chinstrap penguins are close, their distributions are somewhat different.
Knowing just the mean body mass might lead to the assumption that the distributions are similar. However, the spread of the data can be vastly different.
So what we also want to know is how different this data is around this mean.
This difference is more visible when I split this further by species.

Clearly, the male and female body weights not only have different central tendencies, but also different ranges. They vary differently. And this is important to describe.
I need to be able to tell you which penguin is most reliably chonky.
A simple way to describe how different the values are is just giving the range of values -- what is the maximum minus the minimum?
A simple way to describe how different the values are is just giving the range of values -- what is the maximum minus the minimum?
penguins %>% filter(!is.na(sex)) %>% group_by(species, sex) %>% summarise(mean = mean(body_mass_g, na.rm = T), range = max(body_mass_g) - min(body_mass_g)) %>% arrange(-range)
## # A tibble: 6 × 4## # Groups: species [3]## species sex mean range## <fct> <fct> <dbl> <int>## 1 Chinstrap male 3939. 1550## 2 Gentoo male 5485. 1550## 3 Adelie male 4043. 1450## 4 Chinstrap female 3527. 1450## 5 Gentoo female 4680. 1250## 6 Adelie female 3369. 1050A simple way to describe how different the values are is just giving the range of values -- what is the maximum minus the minimum?
penguins %>% filter(!is.na(sex)) %>% group_by(species, sex) %>% summarise(mean = mean(body_mass_g, na.rm = T), range = max(body_mass_g) - min(body_mass_g)) %>% arrange(-range)
## # A tibble: 6 × 4## # Groups: species [3]## species sex mean range## <fct> <fct> <dbl> <int>## 1 Chinstrap male 3939. 1550## 2 Gentoo male 5485. 1550## 3 Adelie male 4043. 1450## 4 Chinstrap female 3527. 1450## 5 Gentoo female 4680. 1250## 6 Adelie female 3369. 1050Hmm, that is a start. This tells me that there's a difference of 1550g between the heaviest and the lightest Gentoo penguin.
But the range only gives us the extremes. While we could also use quantiles to describe this data, a more common measure is the standard deviation, which tells us how much the body mass of penguins tends to deviate from the mean.
SD describes the spread of the data, but also gives a sense of how tightly clustered the data points are around the mean. A lower standard deviation means the data points are more closely packed around the mean, while a higher standard deviation indicates more spread out data.
We're not going to worry about the formula, the sd() function in R is nice enough for now.
penguins %>% filter(!is.na(sex)) %>% group_by(species, sex) %>% summarise( mean = mean(body_mass_g, na.rm = T), sd = sd(body_mass_g, na.rm = T) ) %>% arrange(-sd)
## # A tibble: 6 × 4## # Groups: species [3]## species sex mean sd## <fct> <fct> <dbl> <dbl>## 1 Chinstrap male 3939. 362.## 2 Adelie male 4043. 347.## 3 Gentoo male 5485. 313.## 4 Chinstrap female 3527. 285.## 5 Gentoo female 4680. 282.## 6 Adelie female 3369. 269.penguins %>% filter(!is.na(sex)) %>% group_by(species, sex) %>% summarise( mean = mean(body_mass_g, na.rm = T), sd = sd(body_mass_g, na.rm = T) ) %>% arrange(-sd)
## # A tibble: 6 × 4## # Groups: species [3]## species sex mean sd## <fct> <fct> <dbl> <dbl>## 1 Chinstrap male 3939. 362.## 2 Adelie male 4043. 347.## 3 Gentoo male 5485. 313.## 4 Chinstrap female 3527. 285.## 5 Gentoo female 4680. 282.## 6 Adelie female 3369. 269.This is neat! A smaller standard deviation indicates that the penguin weights are clustered closely around the mean, suggesting a more uniform body mass distribution within the species. Conversely, a large standard deviation implies a wide spread of weights.
Not only are males generally heavier than their female counterparts across all species, but the variability in weight is also generally greater among males.
penguins %>% filter(!is.na(sex)) %>% group_by(species, sex) %>% summarise( mean = mean(body_mass_g, na.rm = T), sd = sd(body_mass_g, na.rm = T) ) %>% arrange(-sd)
## # A tibble: 6 × 4## # Groups: species [3]## species sex mean sd## <fct> <fct> <dbl> <dbl>## 1 Chinstrap male 3939. 362.## 2 Adelie male 4043. 347.## 3 Gentoo male 5485. 313.## 4 Chinstrap female 3527. 285.## 5 Gentoo female 4680. 282.## 6 Adelie female 3369. 269.This is neat! A smaller standard deviation indicates that the penguin weights are clustered closely around the mean, suggesting a more uniform body mass distribution within the species. Conversely, a large standard deviation implies a wide spread of weights.
Not only are males generally heavier than their female counterparts across all species, but the variability in weight is also generally greater among males.
With information about the mean and the standard deviation, you can also understand a lot about the data.
Now Aman will try to smoothly transition into a whiteboard as if he's just thought of something and not practiced it a couple of times. Fake.
Now Aman will try to smoothly transition into a whiteboard as if he's just thought of something and not practiced it a couple of times. Fake.
I told you so.
Remember to plot it 👀
Confidence intervals (CI from hereon) are an extension of standard deviation. While standard deviation gives us a sense of the spread of data, confidence intervals provide a range within which we expect the true population mean to fall. This is typically expressed with a certain level of confidence, often 95%.
I'll create confidence intervals for this data.
Confidence intervals (CI from hereon) are an extension of standard deviation. While standard deviation gives us a sense of the spread of data, confidence intervals provide a range within which we expect the true population mean to fall. This is typically expressed with a certain level of confidence, often 95%.
I'll create confidence intervals for this data.
## # A tibble: 6 × 6## # Groups: species, sex [6]## species sex estimate conf.low conf.high p.value## <fct> <fct> <dbl> <dbl> <dbl> <dbl>## 1 Adelie female 3369. 3306. 3432. 4.64e-81## 2 Adelie male 4043. 3963. 4124. 7.00e-79## 3 Chinstrap female 3527. 3428. 3627. 6.96e-38## 4 Chinstrap male 3939. 3813. 4065. 4.61e-36## 5 Gentoo female 4680. 4606. 4754. 1.54e-71## 6 Gentoo male 5485. 5405. 5565. 1.41e-76Confidence intervals (CI from hereon) are an extension of standard deviation. While standard deviation gives us a sense of the spread of data, confidence intervals provide a range within which we expect the true population mean to fall. This is typically expressed with a certain level of confidence, often 95%.
I'll create confidence intervals for this data.
## # A tibble: 6 × 6## # Groups: species, sex [6]## species sex estimate conf.low conf.high p.value## <fct> <fct> <dbl> <dbl> <dbl> <dbl>## 1 Adelie female 3369. 3306. 3432. 4.64e-81## 2 Adelie male 4043. 3963. 4124. 7.00e-79## 3 Chinstrap female 3527. 3428. 3627. 6.96e-38## 4 Chinstrap male 3939. 3813. 4065. 4.61e-36## 5 Gentoo female 4680. 4606. 4754. 1.54e-71## 6 Gentoo male 5485. 5405. 5565. 1.41e-76Imagine a scenario where you go rogue in Antartica and are hell-bent on measuring every flipping flapping penguin you encounter (god that would be something).
Confidence intervals (CI from hereon) are an extension of standard deviation. While standard deviation gives us a sense of the spread of data, confidence intervals provide a range within which we expect the true population mean to fall. This is typically expressed with a certain level of confidence, often 95%.
I'll create confidence intervals for this data.
## # A tibble: 6 × 6## # Groups: species, sex [6]## species sex estimate conf.low conf.high p.value## <fct> <fct> <dbl> <dbl> <dbl> <dbl>## 1 Adelie female 3369. 3306. 3432. 4.64e-81## 2 Adelie male 4043. 3963. 4124. 7.00e-79## 3 Chinstrap female 3527. 3428. 3627. 6.96e-38## 4 Chinstrap male 3939. 3813. 4065. 4.61e-36## 5 Gentoo female 4680. 4606. 4754. 1.54e-71## 6 Gentoo male 5485. 5405. 5565. 1.41e-76Imagine a scenario where you go rogue in Antartica and are hell-bent on measuring every flipping flapping penguin you encounter (god that would be something).
For all the Adelie female penguins you come across, 95% of them will lie between 3306 and 3432 and the estimated mean of that population is 3368g.
(My father the epidemiologist would like to point out that this is a crude and unorthodox explanation of CI but says it makes sense for most purposes so well. eh.)
I can generate these confidence intervals in a lot of ways. I can manually calculate them:
confidence_interval <- penguins %>% filter(!is.na(body_mass_g), !is.na(sex)) %>% group_by(species, sex) %>% summarise( sample_mean = mean(body_mass_g), sample_sd = sd(body_mass_g), n = n(), se = sample_sd / sqrt(n), ci_lower = sample_mean - qt(0.975, n - 1) * se, ci_upper = sample_mean + qt(0.975, n - 1) * se )confidence_interval
## # A tibble: 6 × 8## # Groups: species [3]## species sex sample_mean sample_sd n se ci_lower ci_upper## <fct> <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>## 1 Adelie female 3369. 269. 73 31.5 3306. 3432.## 2 Adelie male 4043. 347. 73 40.6 3963. 4124.## 3 Chinstrap female 3527. 285. 34 48.9 3428. 3627.## 4 Chinstrap male 3939. 362. 34 62.1 3813. 4065.## 5 Gentoo female 4680. 282. 58 37.0 4606. 4754.## 6 Gentoo male 5485. 313. 61 40.1 5405. 5565.You don't need to understand this code. But fwiw, its just the formula for calculating CI using the summarise() function which we now know. But anyway, this is not ideal.
An easier way is with something called the t.test() function. What a t-test does is beyond the scope of this session, but given a data vector, of say, these values:
df <- c(3,4,6,8,5,5,3,8,4,9,1)t.test(df)An easier way is with something called the t.test() function. What a t-test does is beyond the scope of this session, but given a data vector, of say, these values:
df <- c(3,4,6,8,5,5,3,8,4,9,1)t.test(df)
It returns the mean of that vector, and a set of 95% confidence interval bounds, which tells us the range between which the true mean is likely to fall for this data.
One Sample t-testdata: dft = 6.8415, df = 10, p-value = 4.505e-05alternative hypothesis: true mean is not equal to 095 percent confidence interval:3.432900 6.748918sample estimates:mean of x5.090909An easier way is with something called the t.test() function. What a t-test does is beyond the scope of this session, but given a data vector, of say, these values:
df <- c(3,4,6,8,5,5,3,8,4,9,1)t.test(df)
It returns the mean of that vector, and a set of 95% confidence interval bounds, which tells us the range between which the true mean is likely to fall for this data.
One Sample t-testdata: dft = 6.8415, df = 10, p-value = 4.505e-05alternative hypothesis: true mean is not equal to 095 percent confidence interval:3.432900 6.748918sample estimates:mean of x5.090909
Which can actually be made into a neat dataset because of the tidy() function from the broom package.
df <- c(3,4,6,8,5,5,3,8,4,9,1)broom::tidy(t.test(df)) %>% select(estimate, conf.low, conf.high)
## # A tibble: 1 × 3## estimate conf.low conf.high## <dbl> <dbl> <dbl>## 1 5.09 3.43 6.75Allow me to do it for the data on body_mass_g that we are interested in and also plot it in a tie-fighter plot*.
Allow me to do it for the data on body_mass_g that we are interested in and also plot it in a tie-fighter plot*.
*Only a Sith deals in absolutes. The rest of us use confidence intervals.
This is a section of the code for the previous slide, it is also in your exercise notebook. It uses a variety of data wrangling techniques to make it work within a tidyverse-style pipeline.
penguins %>% filter(!is.na(sex)) %>% group_by(species, sex) %>% nest() %>% mutate(model = map(data, ~ t.test(.$body_mass_g)), tidy_model = map(model, broom::tidy)) %>% unnest(tidy_model) %>% mutate(sex = fct_reorder(sex, estimate)) %>% select(species,sex,estimate,conf.low,conf.high) %>% head(4)
## # A tibble: 4 × 5## # Groups: species, sex [4]## species sex estimate conf.low conf.high## <fct> <fct> <dbl> <dbl> <dbl>## 1 Adelie male 4043. 3963. 4124.## 2 Adelie female 3369. 3306. 3432.## 3 Gentoo female 4680. 4606. 4754.## 4 Gentoo male 5485. 5405. 5565.Like I have said before, it is not necessary for you to completely understand what is happening here but the way to approach this is by breaking it down line by line -- execute only a part of the pipeline and see how the data is being transformed.
If you only run the part till the nest() command, you'll see how I collapsed the data into smaller rows:

Or, you could change the variable going in the t.test() function from body_mass_g to something else, what happens? Group by another variable; use this code as a launchpad to do your own thing.
If you only run the part till the nest() command, you'll see how I collapsed the data into smaller rows:

Or, you could change the variable going in the t.test() function from body_mass_g to something else, what happens? Group by another variable; use this code as a launchpad to do your own thing.
I also endorse ChatGPT as a learning device. Ask it to explain and break things down, use it in a way that helps you learn, not just treat this as a chore.
Maybe this will take time to settle in. But you made it!
Here is a recap in a few lines:
Maybe this will take time to settle in. But you made it!
Here is a recap in a few lines:
We counted the number of penguins and understood how many of each there are.
We became interested in their chonkiness, and calculated their average weights.
Maybe this will take time to settle in. But you made it!
Here is a recap in a few lines:
We counted the number of penguins and understood how many of each there are.
We became interested in their chonkiness, and calculated their average weights.
We then observed this average in relation to the distribution using histograms.
Maybe this will take time to settle in. But you made it!
Here is a recap in a few lines:
We counted the number of penguins and understood how many of each there are.
We became interested in their chonkiness, and calculated their average weights.
We then observed this average in relation to the distribution using histograms.
We reasoned that the mean does not describe the shape and spread of the data by itself and calculated the ranges.
Maybe this will take time to settle in. But you made it!
Here is a recap in a few lines:
We counted the number of penguins and understood how many of each there are.
We became interested in their chonkiness, and calculated their average weights.
We then observed this average in relation to the distribution using histograms.
We reasoned that the mean does not describe the shape and spread of the data by itself and calculated the ranges.
Looking at the ranges, we realized that this only shows us the extremes of the data, and still did not describe how these values were distributed around the mean.
Maybe this will take time to settle in. But you made it!
Here is a recap in a few lines:
We counted the number of penguins and understood how many of each there are.
We became interested in their chonkiness, and calculated their average weights.
We then observed this average in relation to the distribution using histograms.
We reasoned that the mean does not describe the shape and spread of the data by itself and calculated the ranges.
Looking at the ranges, we realized that this only shows us the extremes of the data, and still did not describe how these values were distributed around the mean.
We used standard deviation to observe the spread of data, and worked backwards from the mean and this new value to create a distribution of the data.
Maybe this will take time to settle in. But you made it!
Here is a recap in a few lines:
We counted the number of penguins and understood how many of each there are.
We became interested in their chonkiness, and calculated their average weights.
We then observed this average in relation to the distribution using histograms.
We reasoned that the mean does not describe the shape and spread of the data by itself and calculated the ranges.
Looking at the ranges, we realized that this only shows us the extremes of the data, and still did not describe how these values were distributed around the mean.
We used standard deviation to observe the spread of data, and worked backwards from the mean and this new value to create a distribution of the data.
We then calculated the confidence intervals which allowed us to estimate the range in which the true average weight of the penguin population likely falls, based on our sample data.
Spoilers ahead
Spoilers ahead

bill_length or bill_depth? Hint: It is legal to copy-paste code and change variables!
bill_length or bill_depth? Hint: It is legal to copy-paste code and change variables!
Hint: Hint: Use filter() to create two subsets based on body_mass_g relative to its median, then calculate the means.
bill_length or bill_depth? Hint: It is legal to copy-paste code and change variables!
Hint: Hint: Use filter() to create two subsets based on body_mass_g relative to its median, then calculate the means.
flipper_length_mm for each species to visualize the distribution. Hint: Reuse my histogram code if you need to!
bill_length or bill_depth? Hint: It is legal to copy-paste code and change variables!
Hint: Hint: Use filter() to create two subsets based on body_mass_g relative to its median, then calculate the means.
flipper_length_mm for each species to visualize the distribution. Hint: Reuse my histogram code if you need to!
There's nothing better than practice. Here's some things I continuously refer to, which are non-imposing and easy to follow and get into:
Tidytuesday project, where basically everyone who's participating works on the same dataset. And its fun! On Twitter you can even find this week's things if you search for the hashtag, everyone shares their code as well.
Watch one of the best analyze something live! Dave Robinson (G.O.A.T) who has these amazing screencasts (annotated here). So many topics, so many datasets. The best mentor you can ask for.
There's also the R4DS Slack, which is helpful if you want to ask questions. They also have book clubs, where everyone works through a book together.
For books, there is Modern Dive and the R4DS book which are great.
If you're anything like I am, you probably like to do something first and then understand how it works, you can see a small glimpse of what is possible at the r-graph-gallery, tinker around and then make stuff with your own data.
I literally paraphrased a message to Prakriti for this slide lmao
So long. Thanks for joining.

Crump, M., D. Navarro, and J. Suzuki (2022). “Answering Questions with Data (Textbook): Introductory Statistics for Psychology Students”.
DOI: 10.17605/OSF.IO/JZE52. (Visited on Dec. 10, 2023).
R for Data Science [Book] (2023). R for Data Science [Book]. https://www.oreilly.com/library/view/r-for-data/9781491910382/. (Visited on Dec. 07, 2023).
Grolemund, H. W. a. G. (2023). Welcome | R for Data Science. (Visited on Dec. 07, 2023).
Ismay, C. and A. Y. Kim (2023). Statistical Inference via Data Science: A ModernDive into R and the Tidyverse. https://www.routledge.com/Statistical-Inference-via-Data-Science-A-ModernDive-into-R-and-the-Tidyverse/Ismay-Kim/p/book/9780367409821. (Visited on Dec. 10, 2023).
Practice of Statistics in the Life Sciences 4th Edition | Brigitte Baldi | Macmillan Learning (2023). Practice of Statistics in the Life Sciences 4th Edition | Brigitte Baldi | Macmillan Learning. https://store.macmillanlearning.com/ca/product/Practice-of-Statistics-in-the-Life-Sciences-4th-edition/p/1319013376. (Visited on Dec. 10, 2023).
Robinson, D. (2023). (170) Tidy Tuesday Screencast: Analyzing NYC Restaurant Inspections with R - YouTube. https://www.youtube.com/watch?v=em4FXPf4H-Y&t=1620s. (Visited on Dec. 10, 2023).
By the end of this short session, we want to be able to:
Understand types of data we might encounter, and how to describe it
Be able to derive some basic "summaries" of our data; a stepping stone to greater things.
Focus on a single question and understand how we can answer it using basic tools.
Plot basic histograms and another star-wars themed plot (okay thats a stretch).
This session will skim over a lot of things quickly. The intention is to not become full-blown programmers or statisticians in one day, but introduce ourselves to certain tools and mental models that will go a long way in helping us explore the potential of the data we might handle on a daily basis.
Keyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide |
| ↓, →, Pg Dn, Space, j | Go to next slide |
| Home | Go to first slide |
| End | Go to last slide |
| Number + Return | Go to specific slide |
| b / m / f | Toggle blackout / mirrored / fullscreen mode |
| c | Clone slideshow |
| p | Toggle presenter mode |
| t | Restart the presentation timer |
| ?, h | Toggle this help |
| Esc | Back to slideshow |