+ - 0:00:00
Notes for current slide
Notes for next slide

Getting started with R

with

your friendly neighbourhood A man

1 / 45

So what are we going to do?

By the end of this short session, we want to be able to:

  1. Understand types of data we might encounter, and how to describe it

  2. Be able to derive some basic "summaries" of our data; a stepping stone to greater things.

  3. Focus on a single question and understand how we can answer it using basic tools.

  4. Plot basic histograms and another star-wars themed plot (okay thats a stretch).

This session will skim over a lot of things quickly. The intention is to not become full-blown programmers or statisticians in one day, but introduce ourselves to certain tools and mental models that will go a long way in helping us explore the potential of the data we might handle on a daily basis.

2 / 45

So what are we going to do?

By the end of this short session, we want to be able to:

  1. Understand types of data we might encounter, and how to describe it

  2. Be able to derive some basic "summaries" of our data; a stepping stone to greater things.

  3. Focus on a single question and understand how we can answer it using basic tools.

  4. Plot basic histograms and another star-wars themed plot (okay thats a stretch).

This session will skim over a lot of things quickly. The intention is to not become full-blown programmers or statisticians in one day, but introduce ourselves to certain tools and mental models that will go a long way in helping us explore the potential of the data we might handle on a daily basis.

Simply becoming familiar with the fact that these tools exist and having a glimpse of the things you can do with them is a huge step forward. If stuff goes over your head on the first try, that is okay. This is a process.

2 / 45

Getting around

This session comes with two set of materials:

  1. This slide deck right here, which is what you can follow along with. This is me in PDF format.

  2. Whenever you see a 🐧️ emoji on a slide, that means you need to refer to that section of the exercise notebook.

Download your copy here. Store it in the same folder you have created a new project for this session.

3 / 45

Meet the team

Based on our poll, we decided to go with penguins to help us get started with our learning, and so be it! Meet the Palmer Penguins, who have generously consented to being studied by you today. We'll learn more about them shortly.

Palmer Penguins Artwork by Allison Horst

4 / 45

🐧 Ex. 1.1 Download your co-workers

We need to bring this party down from the Palmer Station, Antarctica before we start.

To do that, open up your notebooks and enter install.packages('palmerpenguins') into the console.

Once it is downloaded, locate the Ex. 1.1 heading and add a new code block. You can do this by pressing Ctrl+ Alt + i.

In your code block, load the fellas into your environment by typing library(palmerpenguins).

To run your code, either press Ctrl + Enter to run that line, or the green Play icon on the right of the code chunk.

5 / 45

Well, well, well, waddle we got here?

Sure would be nice to see what data we have here, right? There's two ways to do that.

🐧 Ex. 1.2 Inspect your data

Pressing Ctrl + Enter on a code chunk line executes it.

penguins # Press Ctrl + Enter inside the code chunk
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
6 / 45

Well, well, well, waddle we got here?

Sure would be nice to see what data we have here, right? There's two ways to do that.

🐧 Ex. 1.2 Inspect your data

Pressing Ctrl + Enter on a code chunk line executes it.

penguins # Press Ctrl + Enter inside the code chunk
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

You can also use View(penguins) in your console to open it as a spreadsheet.

6 / 45

Another fun thing to do is see the structure of your dataset.

str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
7 / 45

Another fun thing to do is see the structure of your dataset.

str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

What are the various variable types you can see?

7 / 45

Before we do ANYTYHING, we need to include tidyverse. Which begs the question

The heck is tidyverse anyway?

Tidyverse is an awesome collection of packages designed for data science. It includes packages for:

  1. Data manipulation (dplyr, tidyr, forcats)
  2. Data visualization (ggplot2)
  3. Data import (readr)
  4. Data handling (stringr, lubridate)

The R community loves their hex stickers

Wanna see the difference? You don't have to know code to see how cool this is.

8 / 45

Let's compare

Scenario: We've got a bunch of penguins. We need to find all Adelie penguins with their bills greater than 40mm and convert this to centimeters, and sort them. Here is this scenario in two coding flavors:

9 / 45

Let's compare

Scenario: We've got a bunch of penguins. We need to find all Adelie penguins with their bills greater than 40mm and convert this to centimeters, and sort them. Here is this scenario in two coding flavors:

In the time before God:

result <- penguins[penguins$species == "Adelie" & penguins$bill_length_mm > 40, ]
result$flipper_length_cm <- result$flipper_length_mm / 10
names(result)[names(result) == 'flipper_length_mm'] <- 'flipper_length_cm'
result <- result[order(result$flipper_length_cm, decreasing = TRUE), ]
# It's like assembling IKEA furniture with a spoon.
9 / 45

Let's compare

Scenario: We've got a bunch of penguins. We need to find all Adelie penguins with their bills greater than 40mm and convert this to centimeters, and sort them. Here is this scenario in two coding flavors:

In the time before God:

result <- penguins[penguins$species == "Adelie" & penguins$bill_length_mm > 40, ]
result$flipper_length_cm <- result$flipper_length_mm / 10
names(result)[names(result) == 'flipper_length_mm'] <- 'flipper_length_cm'
result <- result[order(result$flipper_length_cm, decreasing = TRUE), ]
# It's like assembling IKEA furniture with a spoon.

That's so ugly it is almost Python.

9 / 45

Let's compare

Scenario: We've got a bunch of penguins. We need to find all Adelie penguins with their bills greater than 40mm and convert this to centimeters, and sort them. Here is this scenario in two coding flavors:

In the time before God:

result <- penguins[penguins$species == "Adelie" & penguins$bill_length_mm > 40, ]
result$flipper_length_cm <- result$flipper_length_mm / 10
names(result)[names(result) == 'flipper_length_mm'] <- 'flipper_length_cm'
result <- result[order(result$flipper_length_cm, decreasing = TRUE), ]
# It's like assembling IKEA furniture with a spoon.

That's so ugly it is almost Python.

But the then God said, let there be light:

result <- penguins %>%
filter(species == "Adelie", bill_length_mm > 40) %>%
mutate(flipper_length_cm = flipper_length_mm / 10) %>%
select(-flipper_length_mm) %>%
arrange(desc(flipper_length_cm))
9 / 45

Tidyverse has a grammar

The Tidyverse in R is built around the concept of a "grammar of data manipulation". This grammar is made up of distinct "verbs" that correspond to common data manipulation tasks.

Which is incredibly useful because this language is close to how our cognitive processes work! For example, if I were to spell out my process from the earlier slide:

10 / 45

Tidyverse has a grammar

The Tidyverse in R is built around the concept of a "grammar of data manipulation". This grammar is made up of distinct "verbs" that correspond to common data manipulation tasks.

Which is incredibly useful because this language is close to how our cognitive processes work! For example, if I were to spell out my process from the earlier slide:

  1. Start with the dataset and then
10 / 45

Tidyverse has a grammar

The Tidyverse in R is built around the concept of a "grammar of data manipulation". This grammar is made up of distinct "verbs" that correspond to common data manipulation tasks.

Which is incredibly useful because this language is close to how our cognitive processes work! For example, if I were to spell out my process from the earlier slide:

  1. Start with the dataset and then
  1. filter() to keep only rows where species = "Adelie" and bill_length_mm > 40 and then
10 / 45

Tidyverse has a grammar

The Tidyverse in R is built around the concept of a "grammar of data manipulation". This grammar is made up of distinct "verbs" that correspond to common data manipulation tasks.

Which is incredibly useful because this language is close to how our cognitive processes work! For example, if I were to spell out my process from the earlier slide:

  1. Start with the dataset and then
  1. filter() to keep only rows where species = "Adelie" and bill_length_mm > 40 and then

  2. Add a new column (or mutate() the original dataset), converting flipper length from mm to cm and then

10 / 45

Tidyverse has a grammar

The Tidyverse in R is built around the concept of a "grammar of data manipulation". This grammar is made up of distinct "verbs" that correspond to common data manipulation tasks.

Which is incredibly useful because this language is close to how our cognitive processes work! For example, if I were to spell out my process from the earlier slide:

  1. Start with the dataset and then
  1. filter() to keep only rows where species = "Adelie" and bill_length_mm > 40 and then

  2. Add a new column (or mutate() the original dataset), converting flipper length from mm to cm and then

  3. Exclude the original flipper_length_mm column ( select() everything except that) and then

10 / 45

Tidyverse has a grammar

The Tidyverse in R is built around the concept of a "grammar of data manipulation". This grammar is made up of distinct "verbs" that correspond to common data manipulation tasks.

Which is incredibly useful because this language is close to how our cognitive processes work! For example, if I were to spell out my process from the earlier slide:

  1. Start with the dataset and then
  1. filter() to keep only rows where species = "Adelie" and bill_length_mm > 40 and then

  2. Add a new column (or mutate() the original dataset), converting flipper length from mm to cm and then

  3. Exclude the original flipper_length_mm column ( select() everything except that) and then

  4. arrange() to sort the data in descending order of flipper_length_cm.

10 / 45

Tidyverse has a grammar

The Tidyverse in R is built around the concept of a "grammar of data manipulation". This grammar is made up of distinct "verbs" that correspond to common data manipulation tasks.

Which is incredibly useful because this language is close to how our cognitive processes work! For example, if I were to spell out my process from the earlier slide:

  1. Start with the dataset and then
  1. filter() to keep only rows where species = "Adelie" and bill_length_mm > 40 and then

  2. Add a new column (or mutate() the original dataset), converting flipper length from mm to cm and then

  3. Exclude the original flipper_length_mm column ( select() everything except that) and then

  4. arrange() to sort the data in descending order of flipper_length_cm.

The concept of 'and then' is represented by something called the pipe operator %>%. It seamlessly passes the output of one step as the input to the next, sort of like water flowing through....well I guess a pipe.

Pro tip: When you're in a code chunk, you can just press Ctrl + M and it will add the pipe! No need to type it manually. Try it.

10 / 45

Our verbs for today

The verbs we want to worry about today come from the dplyr package, which is part of the tidyverse. While some of the exercises may include more of them, the ones you need to get comfortable with for most of your work are:

  1. select() - Used for selecting, or removing, columns from your dataset.
  2. filter() - Used for filtering rows based on certain values.
  3. mutate() - Used for adding a new column, and then inserting whatever you want into it.
  4. arrange() - Used for sorting your data based on a column, or group of columns.
  5. group_by() - Used for creating groups within your data's categorical variables.
  6. summarize() - Used for creating a summary out of your data, reducing maybe 100 rows to just 2.
11 / 45

Our verbs for today

The verbs we want to worry about today come from the dplyr package, which is part of the tidyverse. While some of the exercises may include more of them, the ones you need to get comfortable with for most of your work are:

  1. select() - Used for selecting, or removing, columns from your dataset.
  2. filter() - Used for filtering rows based on certain values.
  3. mutate() - Used for adding a new column, and then inserting whatever you want into it.
  4. arrange() - Used for sorting your data based on a column, or group of columns.
  5. group_by() - Used for creating groups within your data's categorical variables.
  6. summarize() - Used for creating a summary out of your data, reducing maybe 100 rows to just 2.

You get all of this goodness and more by including the tidyverse package in the beginning of your notebook.

library(tidyverse)
11 / 45

Our verbs for today

The verbs we want to worry about today come from the dplyr package, which is part of the tidyverse. While some of the exercises may include more of them, the ones you need to get comfortable with for most of your work are:

  1. select() - Used for selecting, or removing, columns from your dataset.
  2. filter() - Used for filtering rows based on certain values.
  3. mutate() - Used for adding a new column, and then inserting whatever you want into it.
  4. arrange() - Used for sorting your data based on a column, or group of columns.
  5. group_by() - Used for creating groups within your data's categorical variables.
  6. summarize() - Used for creating a summary out of your data, reducing maybe 100 rows to just 2.

You get all of this goodness and more by including the tidyverse package in the beginning of your notebook.

library(tidyverse)

But enough talk, let us explore our data now.

11 / 45

Pre-flight checklist.

Usually, when you're faced with a new dataset, you want to get a sense what all those numbers look like. Things like does a certain category dominate the dataset? How many things are we looking at? What are most of those things (the average) like? And so on.

Here's what I do:

  1. Understand the types of variables.
  2. Understand the size of the dataset.
  3. Count whatever I can count.
  4. Figure out what the typical values look like.
  5. Plot some data.

You have already done a little bit of the first two, let is try counting a few things.

12 / 45

Pre-flight checklist.

Usually, when you're faced with a new dataset, you want to get a sense what all those numbers look like. Things like does a certain category dominate the dataset? How many things are we looking at? What are most of those things (the average) like? And so on.

Here's what I do:

  1. Understand the types of variables.
  2. Understand the size of the dataset.
  3. Count whatever I can count.
  4. Figure out what the typical values look like.
  5. Plot some data.

You have already done a little bit of the first two, let is try counting a few things.

We will do this by asking a series of simple questions and then trying to answer them with code.

12 / 45

How many penguins on each island?

Formulate your question into a list of steps:

13 / 45

How many penguins on each island?

Formulate your question into a list of steps:

  1. First, take the data and then
  2. Group it by island
  3. Summarize this by counting the number of penguins in each group
13 / 45

How many penguins on each island?

Formulate your question into a list of steps:

  1. First, take the data and then
  2. Group it by island
  3. Summarize this by counting the number of penguins in each group
penguins %>%
group_by(island) %>%
summarise(count = n())
## # A tibble: 3 × 2
## island count
## <fct> <int>
## 1 Biscoe 168
## 2 Dream 124
## 3 Torgersen 52
13 / 45

TIMEOUT! What is n()?

Any time you see a function that you don't recognize, enter ? followed by the function to read about it.

?n()
14 / 45

TIMEOUT! What is n()?

Any time you see a function that you don't recognize, enter ? followed by the function to read about it.

?n()

Reading the help window, we know how that n() helps us count the number of items in a group! We could ALSO have done this using count(), another dplyr function

penguins %>%
group_by(island) %>%
count()
## # A tibble: 3 × 2
## # Groups: island [3]
## island n
## <fct> <int>
## 1 Biscoe 168
## 2 Dream 124
## 3 Torgersen 52
14 / 45

Would you like to see what is happening here? Go to this link to check out the step-by-step process.

summarize() takes our grouped rows, does the counting and returns only the necessary data. A summary!

Summarizing

15 / 45

Would you like to see what is happening here? Go to this link to check out the step-by-step process.

summarize() takes our grouped rows, does the counting and returns only the necessary data. A summary!

Summarizing

You can use this website to input your code to see how each steps looks like (as long as it has the palmerpenguins data)

15 / 45

🐧 Ex. 2.1 Your turn, how many male and female penguins on each island?

Hint: Try running ?group_by() in the console. Does it allow more than one variable? How can you add another one?

Next slide has the answer.

For extra marks, visualize your code solution on the Tidy Data Visualizer and understand what is happening.

16 / 45
penguins %>%
group_by(island, sex) %>%
summarise(count = n())
## # A tibble: 9 × 3
## # Groups: island [3]
## island sex count
## <fct> <fct> <int>
## 1 Biscoe female 80
## 2 Biscoe male 83
## 3 Biscoe <NA> 5
## 4 Dream female 61
## 5 Dream male 62
## 6 Dream <NA> 1
## 7 Torgersen female 24
## 8 Torgersen male 23
## 9 Torgersen <NA> 5
17 / 45

Oh wait, are those some NAs I see there? That means that value was not recorded and is not available. You want to filter them out before you count anything right?

18 / 45

Oh wait, are those some NAs I see there? That means that value was not recorded and is not available. You want to filter them out before you count anything right?

The filter() command can help in the following cases:

  • Keep only those rows which meet a certain criteria
# Only keep those after 2008
penguins %>%
filter(year > 2008) %>%
head(2) # Shows only the first two rows
## # A tibble: 2 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Biscoe 35 17.9 192 3725
## 2 Adelie Biscoe 41 20 203 4725
## # ℹ 2 more variables: sex <fct>, year <int>
18 / 45

Oh wait, are those some NAs I see there? That means that value was not recorded and is not available. You want to filter them out before you count anything right?

The filter() command can help in the following cases:

  • Keep only those rows which meet a certain criteria
# Only keep those after 2008
penguins %>%
filter(year > 2008) %>%
head(2) # Shows only the first two rows
## # A tibble: 2 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Biscoe 35 17.9 192 3725
## 2 Adelie Biscoe 41 20 203 4725
## # ℹ 2 more variables: sex <fct>, year <int>
  • Keep everything EXCEPT a certain value
penguins %>%
filter(island != "Adelie") %>%
head(2)
18 / 45
filter() cheatsheet
Use Case Code Example
Basic Usage

penguins %>% filter(species == 'Adelie’)

Multiple Conditions (AND)

penguins %>% filter(species == 'Adelie' & island == 'Torgersen')

Multiple Conditions (OR)

penguins %>% filter(species == 'Adelie' | species == 'Chinstrap')

Exclude with NOT

penguins %>% filter(!(species == 'Adelie'))

Filtering with %in%

penguins %>% filter(island %in% c('Dream', 'Torgersen'))

Between Two Values

penguins %>% filter(bill_length_mm >= 40, bill_length_mm <= 50)

NA Handling

penguins %>% filter(is.na(bill_depth_mm)) or penguins %>% filter(!is.na(bill_depth_mm))

19 / 45
filter() cheatsheet
Use Case Code Example
Basic Usage

penguins %>% filter(species == 'Adelie’)

Multiple Conditions (AND)

penguins %>% filter(species == 'Adelie' & island == 'Torgersen')

Multiple Conditions (OR)

penguins %>% filter(species == 'Adelie' | species == 'Chinstrap')

Exclude with NOT

penguins %>% filter(!(species == 'Adelie'))

Filtering with %in%

penguins %>% filter(island %in% c('Dream', 'Torgersen'))

Between Two Values

penguins %>% filter(bill_length_mm >= 40, bill_length_mm <= 50)

NA Handling

penguins %>% filter(is.na(bill_depth_mm)) or penguins %>% filter(!is.na(bill_depth_mm))

🐧 Ex. 2.2 Can you remove NAs before you count penguins by gender?

19 / 45

Your code should have looked something like this:

penguins %>%
filter(!is.na(sex)) %>%
group_by(island, sex) %>%
summarise(count = n())
## # A tibble: 6 × 3
## # Groups: island [3]
## island sex count
## <fct> <fct> <int>
## 1 Biscoe female 80
## 2 Biscoe male 83
## 3 Dream female 61
## 4 Dream male 62
## 5 Torgersen female 24
## 6 Torgersen male 23
20 / 45

Counting's cool, but there's more!

When we analyse data, there are a few important things we try to look for:

  1. Data Distribution: Understanding how data is spread out. What's the overall shape or pattern of the data?
21 / 45

Counting's cool, but there's more!

When we analyse data, there are a few important things we try to look for:

  1. Data Distribution: Understanding how data is spread out. What's the overall shape or pattern of the data?

  2. Central Tendencies: Identifying typical or average values in the data, such as the mean (average), median (middle value), and mode (most frequent).

21 / 45

Counting's cool, but there's more!

When we analyse data, there are a few important things we try to look for:

  1. Data Distribution: Understanding how data is spread out. What's the overall shape or pattern of the data?

  2. Central Tendencies: Identifying typical or average values in the data, such as the mean (average), median (middle value), and mode (most frequent).

  3. Data Variation: Examining the extent to which data points differ from each other. How consistent or varied are the values?

With these concepts, we can effectively describe and summarize data, enhancing both analysis and visualization.

21 / 45

Counting's cool, but there's more!

When we analyse data, there are a few important things we try to look for:

  1. Data Distribution: Understanding how data is spread out. What's the overall shape or pattern of the data?

  2. Central Tendencies: Identifying typical or average values in the data, such as the mean (average), median (middle value), and mode (most frequent).

  3. Data Variation: Examining the extent to which data points differ from each other. How consistent or varied are the values?

With these concepts, we can effectively describe and summarize data, enhancing both analysis and visualization.

To help keep us aligned on our path, we're only going to look at one question.

21 / 45

Quantifying chonkiness

We need to answer the all important question of which penguins are the chonkiest, and more importantly, which are RELIABLY chonky.

For the next few demonstrations, we'll choose body_mass_g as our variable of interest.

chonking

22 / 45

Back to school

We're probably aware of what central tendencies are, but a quick refresher. Given a dataset of values like so:

df <- c(13, 15, 105, 2, 25, 30, 35, 40, 10, 15)
23 / 45

Back to school

We're probably aware of what central tendencies are, but a quick refresher. Given a dataset of values like so:

df <- c(13, 15, 105, 2, 25, 30, 35, 40, 10, 15)

Mean

The mean is just the average of all these values. No worries about formulas for now, R has a built in function.

mean(df)
## [1] 29
23 / 45

Back to school

We're probably aware of what central tendencies are, but a quick refresher. Given a dataset of values like so:

df <- c(13, 15, 105, 2, 25, 30, 35, 40, 10, 15)

Mean

The mean is just the average of all these values. No worries about formulas for now, R has a built in function.

mean(df)
## [1] 29

Median

The median is the middle value of the sorted numbers, separating the lower half from the higher half.

median(df)
## [1] 20
23 / 45

Back to school

We're probably aware of what central tendencies are, but a quick refresher. Given a dataset of values like so:

df <- c(13, 15, 105, 2, 25, 30, 35, 40, 10, 15)

Mean

The mean is just the average of all these values. No worries about formulas for now, R has a built in function.

mean(df)
## [1] 29

Median

The median is the middle value of the sorted numbers, separating the lower half from the higher half.

median(df)
## [1] 20

Mode

The mode is the most frequently occurring value.

## [1] 15
23 / 45

🐧 Ex. 2.3 Use summarise() to calculate the mean and median of the body_mass_g by species and gender for the entire dataset.

I've partially done it for you.

penguins %>%
group_by(species) %>%
summarise(mean_body_mass = mean(body_mass_g, na.rm = T),
median_body_mass = )

Hint: You can create a new column by adding a similar line after a comma. Also, what is the na.rm doing?

24 / 45

We want something like this.

penguins %>%
filter(!is.na(sex)) %>%
group_by(species, sex) %>%
summarise(mean_body_mass = mean(body_mass_g, na.rm = T),
median_body_mass = median(body_mass_g, na.rm = T))
## # A tibble: 6 × 4
## # Groups: species [3]
## species sex mean_body_mass median_body_mass
## <fct> <fct> <dbl> <dbl>
## 1 Adelie female 3369. 3400
## 2 Adelie male 4043. 4000
## 3 Chinstrap female 3527. 3550
## 4 Chinstrap male 3939. 3950
## 5 Gentoo female 4680. 4700
## 6 Gentoo male 5485. 5500
25 / 45

The mean by itself often doesn't tell the full story. Have a look at when the plot the real data using a histogram.

A histogram is a graphical representation of the distribution of numerical data. It shows the frequency of values within specific intervals, known as bins. The histogram below would show how many penguins fall into different weight categories.

Look at the distributions below.

26 / 45

Observations from our histograms:

  1. Each species shows a different spread of body mass values. Some are more symmetrical, while others are slightly skewed to the left or right. Mean and median do not capture this aspect of the data.
27 / 45

Observations from our histograms:

  1. Each species shows a different spread of body mass values. Some are more symmetrical, while others are slightly skewed to the left or right. Mean and median do not capture this aspect of the data.

  2. While the means of the Adelie and Chinstrap penguins are close, their distributions are somewhat different.

27 / 45

Observations from our histograms:

  1. Each species shows a different spread of body mass values. Some are more symmetrical, while others are slightly skewed to the left or right. Mean and median do not capture this aspect of the data.

  2. While the means of the Adelie and Chinstrap penguins are close, their distributions are somewhat different.

  3. Knowing just the mean body mass might lead to the assumption that the distributions are similar. However, the spread of the data can be vastly different.

So what we also want to know is how different this data is around this mean.

27 / 45

This difference is more visible when I split this further by species.

Clearly, the male and female body weights not only have different central tendencies, but also different ranges. They vary differently. And this is important to describe.

I need to be able to tell you which penguin is most reliably chonky.

28 / 45

Describing differentness

A simple way to describe how different the values are is just giving the range of values -- what is the maximum minus the minimum?

29 / 45

Describing differentness

A simple way to describe how different the values are is just giving the range of values -- what is the maximum minus the minimum?

penguins %>%
filter(!is.na(sex)) %>%
group_by(species, sex) %>%
summarise(mean = mean(body_mass_g, na.rm = T),
range = max(body_mass_g) - min(body_mass_g)) %>%
arrange(-range)
## # A tibble: 6 × 4
## # Groups: species [3]
## species sex mean range
## <fct> <fct> <dbl> <int>
## 1 Chinstrap male 3939. 1550
## 2 Gentoo male 5485. 1550
## 3 Adelie male 4043. 1450
## 4 Chinstrap female 3527. 1450
## 5 Gentoo female 4680. 1250
## 6 Adelie female 3369. 1050
29 / 45

Describing differentness

A simple way to describe how different the values are is just giving the range of values -- what is the maximum minus the minimum?

penguins %>%
filter(!is.na(sex)) %>%
group_by(species, sex) %>%
summarise(mean = mean(body_mass_g, na.rm = T),
range = max(body_mass_g) - min(body_mass_g)) %>%
arrange(-range)
## # A tibble: 6 × 4
## # Groups: species [3]
## species sex mean range
## <fct> <fct> <dbl> <int>
## 1 Chinstrap male 3939. 1550
## 2 Gentoo male 5485. 1550
## 3 Adelie male 4043. 1450
## 4 Chinstrap female 3527. 1450
## 5 Gentoo female 4680. 1250
## 6 Adelie female 3369. 1050

Hmm, that is a start. This tells me that there's a difference of 1550g between the heaviest and the lightest Gentoo penguin.

But the range only gives us the extremes. While we could also use quantiles to describe this data, a more common measure is the standard deviation, which tells us how much the body mass of penguins tends to deviate from the mean.

29 / 45

Measuring spread

SD describes the spread of the data, but also gives a sense of how tightly clustered the data points are around the mean. A lower standard deviation means the data points are more closely packed around the mean, while a higher standard deviation indicates more spread out data.

We're not going to worry about the formula, the sd() function in R is nice enough for now.

30 / 45
penguins %>%
filter(!is.na(sex)) %>%
group_by(species, sex) %>%
summarise(
mean = mean(body_mass_g, na.rm = T),
sd = sd(body_mass_g, na.rm = T)
) %>%
arrange(-sd)
## # A tibble: 6 × 4
## # Groups: species [3]
## species sex mean sd
## <fct> <fct> <dbl> <dbl>
## 1 Chinstrap male 3939. 362.
## 2 Adelie male 4043. 347.
## 3 Gentoo male 5485. 313.
## 4 Chinstrap female 3527. 285.
## 5 Gentoo female 4680. 282.
## 6 Adelie female 3369. 269.
31 / 45
penguins %>%
filter(!is.na(sex)) %>%
group_by(species, sex) %>%
summarise(
mean = mean(body_mass_g, na.rm = T),
sd = sd(body_mass_g, na.rm = T)
) %>%
arrange(-sd)
## # A tibble: 6 × 4
## # Groups: species [3]
## species sex mean sd
## <fct> <fct> <dbl> <dbl>
## 1 Chinstrap male 3939. 362.
## 2 Adelie male 4043. 347.
## 3 Gentoo male 5485. 313.
## 4 Chinstrap female 3527. 285.
## 5 Gentoo female 4680. 282.
## 6 Adelie female 3369. 269.

This is neat! A smaller standard deviation indicates that the penguin weights are clustered closely around the mean, suggesting a more uniform body mass distribution within the species. Conversely, a large standard deviation implies a wide spread of weights.

Not only are males generally heavier than their female counterparts across all species, but the variability in weight is also generally greater among males.

31 / 45
penguins %>%
filter(!is.na(sex)) %>%
group_by(species, sex) %>%
summarise(
mean = mean(body_mass_g, na.rm = T),
sd = sd(body_mass_g, na.rm = T)
) %>%
arrange(-sd)
## # A tibble: 6 × 4
## # Groups: species [3]
## species sex mean sd
## <fct> <fct> <dbl> <dbl>
## 1 Chinstrap male 3939. 362.
## 2 Adelie male 4043. 347.
## 3 Gentoo male 5485. 313.
## 4 Chinstrap female 3527. 285.
## 5 Gentoo female 4680. 282.
## 6 Adelie female 3369. 269.

This is neat! A smaller standard deviation indicates that the penguin weights are clustered closely around the mean, suggesting a more uniform body mass distribution within the species. Conversely, a large standard deviation implies a wide spread of weights.

Not only are males generally heavier than their female counterparts across all species, but the variability in weight is also generally greater among males.

With information about the mean and the standard deviation, you can also understand a lot about the data.

31 / 45

Now Aman will try to smoothly transition into a whiteboard as if he's just thought of something and not practiced it a couple of times. Fake.

32 / 45

Now Aman will try to smoothly transition into a whiteboard as if he's just thought of something and not practiced it a couple of times. Fake.

I told you so.

32 / 45

🐧 Ex. 2.4 Your exercise notebook has a mystery dataset. What do you notice about the mean of the X and Y column? Is this differentness or similarity reflected in their distributions?

Remember to plot it 👀

33 / 45

Confidence intervals

Confidence intervals (CI from hereon) are an extension of standard deviation. While standard deviation gives us a sense of the spread of data, confidence intervals provide a range within which we expect the true population mean to fall. This is typically expressed with a certain level of confidence, often 95%.

I'll create confidence intervals for this data.

34 / 45

Confidence intervals

Confidence intervals (CI from hereon) are an extension of standard deviation. While standard deviation gives us a sense of the spread of data, confidence intervals provide a range within which we expect the true population mean to fall. This is typically expressed with a certain level of confidence, often 95%.

I'll create confidence intervals for this data.

## # A tibble: 6 × 6
## # Groups: species, sex [6]
## species sex estimate conf.low conf.high p.value
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie female 3369. 3306. 3432. 4.64e-81
## 2 Adelie male 4043. 3963. 4124. 7.00e-79
## 3 Chinstrap female 3527. 3428. 3627. 6.96e-38
## 4 Chinstrap male 3939. 3813. 4065. 4.61e-36
## 5 Gentoo female 4680. 4606. 4754. 1.54e-71
## 6 Gentoo male 5485. 5405. 5565. 1.41e-76
34 / 45

Confidence intervals

Confidence intervals (CI from hereon) are an extension of standard deviation. While standard deviation gives us a sense of the spread of data, confidence intervals provide a range within which we expect the true population mean to fall. This is typically expressed with a certain level of confidence, often 95%.

I'll create confidence intervals for this data.

## # A tibble: 6 × 6
## # Groups: species, sex [6]
## species sex estimate conf.low conf.high p.value
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie female 3369. 3306. 3432. 4.64e-81
## 2 Adelie male 4043. 3963. 4124. 7.00e-79
## 3 Chinstrap female 3527. 3428. 3627. 6.96e-38
## 4 Chinstrap male 3939. 3813. 4065. 4.61e-36
## 5 Gentoo female 4680. 4606. 4754. 1.54e-71
## 6 Gentoo male 5485. 5405. 5565. 1.41e-76

What this ugly table is trying to say is this:

Imagine a scenario where you go rogue in Antartica and are hell-bent on measuring every flipping flapping penguin you encounter (god that would be something).

34 / 45

Confidence intervals

Confidence intervals (CI from hereon) are an extension of standard deviation. While standard deviation gives us a sense of the spread of data, confidence intervals provide a range within which we expect the true population mean to fall. This is typically expressed with a certain level of confidence, often 95%.

I'll create confidence intervals for this data.

## # A tibble: 6 × 6
## # Groups: species, sex [6]
## species sex estimate conf.low conf.high p.value
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie female 3369. 3306. 3432. 4.64e-81
## 2 Adelie male 4043. 3963. 4124. 7.00e-79
## 3 Chinstrap female 3527. 3428. 3627. 6.96e-38
## 4 Chinstrap male 3939. 3813. 4065. 4.61e-36
## 5 Gentoo female 4680. 4606. 4754. 1.54e-71
## 6 Gentoo male 5485. 5405. 5565. 1.41e-76

What this ugly table is trying to say is this:

Imagine a scenario where you go rogue in Antartica and are hell-bent on measuring every flipping flapping penguin you encounter (god that would be something).

For all the Adelie female penguins you come across, 95% of them will lie between 3306 and 3432 and the estimated mean of that population is 3368g.

(My father the epidemiologist would like to point out that this is a crude and unorthodox explanation of CI but says it makes sense for most purposes so well. eh.)

34 / 45

'Ze hard way

I can generate these confidence intervals in a lot of ways. I can manually calculate them:

confidence_interval <- penguins %>%
filter(!is.na(body_mass_g), !is.na(sex)) %>%
group_by(species, sex) %>%
summarise(
sample_mean = mean(body_mass_g),
sample_sd = sd(body_mass_g),
n = n(),
se = sample_sd / sqrt(n),
ci_lower = sample_mean - qt(0.975, n - 1) * se,
ci_upper = sample_mean + qt(0.975, n - 1) * se
)
confidence_interval
## # A tibble: 6 × 8
## # Groups: species [3]
## species sex sample_mean sample_sd n se ci_lower ci_upper
## <fct> <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Adelie female 3369. 269. 73 31.5 3306. 3432.
## 2 Adelie male 4043. 347. 73 40.6 3963. 4124.
## 3 Chinstrap female 3527. 285. 34 48.9 3428. 3627.
## 4 Chinstrap male 3939. 362. 34 62.1 3813. 4065.
## 5 Gentoo female 4680. 282. 58 37.0 4606. 4754.
## 6 Gentoo male 5485. 313. 61 40.1 5405. 5565.

You don't need to understand this code. But fwiw, its just the formula for calculating CI using the summarise() function which we now know. But anyway, this is not ideal.

35 / 45

Intervals (t's version)

An easier way is with something called the t.test() function. What a t-test does is beyond the scope of this session, but given a data vector, of say, these values:

df <- c(3,4,6,8,5,5,3,8,4,9,1)
t.test(df)
36 / 45

Intervals (t's version)

An easier way is with something called the t.test() function. What a t-test does is beyond the scope of this session, but given a data vector, of say, these values:

df <- c(3,4,6,8,5,5,3,8,4,9,1)
t.test(df)

It returns the mean of that vector, and a set of 95% confidence interval bounds, which tells us the range between which the true mean is likely to fall for this data.

One Sample t-test
data: df
t = 6.8415, df = 10, p-value = 4.505e-05
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
3.432900 6.748918
sample estimates:
mean of x
5.090909
36 / 45

Intervals (t's version)

An easier way is with something called the t.test() function. What a t-test does is beyond the scope of this session, but given a data vector, of say, these values:

df <- c(3,4,6,8,5,5,3,8,4,9,1)
t.test(df)

It returns the mean of that vector, and a set of 95% confidence interval bounds, which tells us the range between which the true mean is likely to fall for this data.

One Sample t-test
data: df
t = 6.8415, df = 10, p-value = 4.505e-05
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
3.432900 6.748918
sample estimates:
mean of x
5.090909

Which can actually be made into a neat dataset because of the tidy() function from the broom package.

df <- c(3,4,6,8,5,5,3,8,4,9,1)
broom::tidy(t.test(df)) %>% select(estimate, conf.low, conf.high)
## # A tibble: 1 × 3
## estimate conf.low conf.high
## <dbl> <dbl> <dbl>
## 1 5.09 3.43 6.75
36 / 45

Allow me to do it for the data on body_mass_g that we are interested in and also plot it in a tie-fighter plot*.

37 / 45

Allow me to do it for the data on body_mass_g that we are interested in and also plot it in a tie-fighter plot*.

*Only a Sith deals in absolutes. The rest of us use confidence intervals.

37 / 45

Learning by tinkering

This is a section of the code for the previous slide, it is also in your exercise notebook. It uses a variety of data wrangling techniques to make it work within a tidyverse-style pipeline.

penguins %>%
filter(!is.na(sex)) %>%
group_by(species, sex) %>%
nest() %>%
mutate(model = map(data, ~ t.test(.$body_mass_g)),
tidy_model = map(model, broom::tidy)) %>%
unnest(tidy_model) %>%
mutate(sex = fct_reorder(sex, estimate)) %>%
select(species,sex,estimate,conf.low,conf.high) %>%
head(4)
## # A tibble: 4 × 5
## # Groups: species, sex [4]
## species sex estimate conf.low conf.high
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 Adelie male 4043. 3963. 4124.
## 2 Adelie female 3369. 3306. 3432.
## 3 Gentoo female 4680. 4606. 4754.
## 4 Gentoo male 5485. 5405. 5565.

Like I have said before, it is not necessary for you to completely understand what is happening here but the way to approach this is by breaking it down line by line -- execute only a part of the pipeline and see how the data is being transformed.

38 / 45

...continued

If you only run the part till the nest() command, you'll see how I collapsed the data into smaller rows:

nest

Or, you could change the variable going in the t.test() function from body_mass_g to something else, what happens? Group by another variable; use this code as a launchpad to do your own thing.

39 / 45

...continued

If you only run the part till the nest() command, you'll see how I collapsed the data into smaller rows:

nest

Or, you could change the variable going in the t.test() function from body_mass_g to something else, what happens? Group by another variable; use this code as a launchpad to do your own thing.

I also endorse ChatGPT as a learning device. Ask it to explain and break things down, use it in a way that helps you learn, not just treat this as a chore.

39 / 45

Wow that was intense

Maybe this will take time to settle in. But you made it!

Here is a recap in a few lines:

  1. We counted the number of penguins and understood how many of each there are.
40 / 45

Wow that was intense

Maybe this will take time to settle in. But you made it!

Here is a recap in a few lines:

  1. We counted the number of penguins and understood how many of each there are.

  2. We became interested in their chonkiness, and calculated their average weights.

40 / 45

Wow that was intense

Maybe this will take time to settle in. But you made it!

Here is a recap in a few lines:

  1. We counted the number of penguins and understood how many of each there are.

  2. We became interested in their chonkiness, and calculated their average weights.

  3. We then observed this average in relation to the distribution using histograms.

40 / 45

Wow that was intense

Maybe this will take time to settle in. But you made it!

Here is a recap in a few lines:

  1. We counted the number of penguins and understood how many of each there are.

  2. We became interested in their chonkiness, and calculated their average weights.

  3. We then observed this average in relation to the distribution using histograms.

  4. We reasoned that the mean does not describe the shape and spread of the data by itself and calculated the ranges.

40 / 45

Wow that was intense

Maybe this will take time to settle in. But you made it!

Here is a recap in a few lines:

  1. We counted the number of penguins and understood how many of each there are.

  2. We became interested in their chonkiness, and calculated their average weights.

  3. We then observed this average in relation to the distribution using histograms.

  4. We reasoned that the mean does not describe the shape and spread of the data by itself and calculated the ranges.

  5. Looking at the ranges, we realized that this only shows us the extremes of the data, and still did not describe how these values were distributed around the mean.

40 / 45

Wow that was intense

Maybe this will take time to settle in. But you made it!

Here is a recap in a few lines:

  1. We counted the number of penguins and understood how many of each there are.

  2. We became interested in their chonkiness, and calculated their average weights.

  3. We then observed this average in relation to the distribution using histograms.

  4. We reasoned that the mean does not describe the shape and spread of the data by itself and calculated the ranges.

  5. Looking at the ranges, we realized that this only shows us the extremes of the data, and still did not describe how these values were distributed around the mean.

  6. We used standard deviation to observe the spread of data, and worked backwards from the mean and this new value to create a distribution of the data.

40 / 45

Wow that was intense

Maybe this will take time to settle in. But you made it!

Here is a recap in a few lines:

  1. We counted the number of penguins and understood how many of each there are.

  2. We became interested in their chonkiness, and calculated their average weights.

  3. We then observed this average in relation to the distribution using histograms.

  4. We reasoned that the mean does not describe the shape and spread of the data by itself and calculated the ranges.

  5. Looking at the ranges, we realized that this only shows us the extremes of the data, and still did not describe how these values were distributed around the mean.

  6. We used standard deviation to observe the spread of data, and worked backwards from the mean and this new value to create a distribution of the data.

  7. We then calculated the confidence intervals which allowed us to estimate the range in which the true average weight of the penguin population likely falls, based on our sample data.

40 / 45

So who da chonkiest?

41 / 45

So who da chonkiest?

Spoilers ahead

41 / 45

So who da chonkiest?

Spoilers ahead

gentoooooo

41 / 45

🐧 Your turn!

  • Ex. 2.7: Your exercise notebook has all that you need to create a similar chart, and a couple of others. Can you do a similar analysis of bill_length or bill_depth?

Hint: It is legal to copy-paste code and change variables!

42 / 45

🐧 Your turn!

  • Ex. 2.7: Your exercise notebook has all that you need to create a similar chart, and a couple of others. Can you do a similar analysis of bill_length or bill_depth?

Hint: It is legal to copy-paste code and change variables!

  • Ex. 2.8: Find the mean flipper_length_mm for penguins with body_mass_g above and below the median.

Hint: Hint: Use filter() to create two subsets based on body_mass_g relative to its median, then calculate the means.

42 / 45

🐧 Your turn!

  • Ex. 2.7: Your exercise notebook has all that you need to create a similar chart, and a couple of others. Can you do a similar analysis of bill_length or bill_depth?

Hint: It is legal to copy-paste code and change variables!

  • Ex. 2.8: Find the mean flipper_length_mm for penguins with body_mass_g above and below the median.

Hint: Hint: Use filter() to create two subsets based on body_mass_g relative to its median, then calculate the means.

  • Ex. 2.9: Create a histogram of flipper_length_mm for each species to visualize the distribution.

Hint: Reuse my histogram code if you need to!

42 / 45

🐧 Your turn!

  • Ex. 2.7: Your exercise notebook has all that you need to create a similar chart, and a couple of others. Can you do a similar analysis of bill_length or bill_depth?

Hint: It is legal to copy-paste code and change variables!

  • Ex. 2.8: Find the mean flipper_length_mm for penguins with body_mass_g above and below the median.

Hint: Hint: Use filter() to create two subsets based on body_mass_g relative to its median, then calculate the means.

  • Ex. 2.9: Create a histogram of flipper_length_mm for each species to visualize the distribution.

Hint: Reuse my histogram code if you need to!

  • Ex. 3.0: Visit the R Graph Gallery. Find a visualization you like, copy paste the code and replace variable names to arrive at a similar visualization. Go crazy.
42 / 45

Further resources

There's nothing better than practice. Here's some things I continuously refer to, which are non-imposing and easy to follow and get into:

  1. Tidytuesday project, where basically everyone who's participating works on the same dataset. And its fun! On Twitter you can even find this week's things if you search for the hashtag, everyone shares their code as well.

  2. Watch one of the best analyze something live! Dave Robinson (G.O.A.T) who has these amazing screencasts (annotated here). So many topics, so many datasets. The best mentor you can ask for.

  3. There's also the R4DS Slack, which is helpful if you want to ask questions. They also have book clubs, where everyone works through a book together.

  4. For books, there is Modern Dive and the R4DS book which are great.

  5. If you're anything like I am, you probably like to do something first and then understand how it works, you can see a small glimpse of what is possible at the r-graph-gallery, tinker around and then make stuff with your own data.

I literally paraphrased a message to Prakriti for this slide lmao

43 / 45

Okay bye

So long. Thanks for joining.

pingu

44 / 45

References

Crump, M., D. Navarro, and J. Suzuki (2022). “Answering Questions with Data (Textbook): Introductory Statistics for Psychology Students”.

DOI: 10.17605/OSF.IO/JZE52. (Visited on Dec. 10, 2023).

R for Data Science [Book] (2023). R for Data Science [Book]. https://www.oreilly.com/library/view/r-for-data/9781491910382/. (Visited on Dec. 07, 2023).

Grolemund, H. W. a. G. (2023). Welcome | R for Data Science. (Visited on Dec. 07, 2023).

Ismay, C. and A. Y. Kim (2023). Statistical Inference via Data Science: A ModernDive into R and the Tidyverse. https://www.routledge.com/Statistical-Inference-via-Data-Science-A-ModernDive-into-R-and-the-Tidyverse/Ismay-Kim/p/book/9780367409821. (Visited on Dec. 10, 2023).

Practice of Statistics in the Life Sciences 4th Edition | Brigitte Baldi | Macmillan Learning (2023). Practice of Statistics in the Life Sciences 4th Edition | Brigitte Baldi | Macmillan Learning. https://store.macmillanlearning.com/ca/product/Practice-of-Statistics-in-the-Life-Sciences-4th-edition/p/1319013376. (Visited on Dec. 10, 2023).

Robinson, D. (2023). (170) Tidy Tuesday Screencast: Analyzing NYC Restaurant Inspections with R - YouTube. https://www.youtube.com/watch?v=em4FXPf4H-Y&t=1620s. (Visited on Dec. 10, 2023).

45 / 45

So what are we going to do?

By the end of this short session, we want to be able to:

  1. Understand types of data we might encounter, and how to describe it

  2. Be able to derive some basic "summaries" of our data; a stepping stone to greater things.

  3. Focus on a single question and understand how we can answer it using basic tools.

  4. Plot basic histograms and another star-wars themed plot (okay thats a stretch).

This session will skim over a lot of things quickly. The intention is to not become full-blown programmers or statisticians in one day, but introduce ourselves to certain tools and mental models that will go a long way in helping us explore the potential of the data we might handle on a daily basis.

2 / 45
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow