This handout demonstrates how to conduct some of the analyses we learned in EDPY 577/677 using R code instead of the SPSS point-and-click method.
It is recommended that you use an R Project for each course or major assignment. This will avoid having to set a working directory and write long file paths. You can move your files anywhere. As long as the R Project .rproj file is included, there should be no issues.
To create an R Project, simply go to File > New Project…. If you want to create a new directory, select New Directory. On the next screen, choose New Project, locate where you want to put your project, and click Create Project. If you have a directory already and you just want to make it into a project, click Existing Directory. The Version Control option is useful if you will be placing your project on Github.
To access a folder within your R Project folder, you can use a simple path such as read.csv("data/data.csv")
. To access a file outside of your R Project folder, you can use ..
: read.csv("../other_data/data.csv")
.
Packages are installed by typing install.packages("package name")
in the R
console. To replicate the examples in this document, you will need to install the following packages (Note: you only need to install packages one time):
install.packages("palmerpenguins") # for example data
install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")
Loading a package can be done with the following code: library(package name)
. It is recommended to load all packages at the top of your script file.
You can load many different data types using R
:
CSV files with read.csv
from the base R
package:
read.csv("name_of_file.csv")
read.csv("http://www.website.com/internet/file.csv)
Excel files with read_excel
from the readxl
package (you may need to modify function defaults for read_excel to work properly). :
read_excel("name_of_file.xlsx")
Personally, I like to convert Excel files to a CSV file and load the CSV file into R
The haven
package reads data from other statistical software packages:
read_spss("name_of_file.sav")
or read_sav
reads SPSS data filesread_dta
reads Stata data filesread_sas
reads SAS data fileshaven
can also save a data set as an SPSS, Stata, or SAS data file using the write_sav
, write_dta
, and write_sas
functionsR is an object-oriented program language. That means that results are stored as objects that you can retrieve and manipulate later. As such, you save data and results as R objects. You can use =
or <-
. <-
is recommended. Examples: Note: c() means “combine”
# character vector
phrase <- "Good luck in Stats III"
#numeric vector
answers <- c(6*5, 6^2, 5*sqrt(23))
# a data frame
dataframe <- data.frame(col_1 = c(1,2,3),
col_2 = c("A", "B", "C"))
# assign data to an object named penguins
penguins <- read.csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")
# assign results from a t-test to t_test_results
t_test_results <- t.test(body_mass_g ~ sex, data=penguins)
For more practice, check out this interactive tutorial on R programming basics.
%>%
In some of the code below you will see the pipe operator, %>%
. This is a function from the tidyverse
package that allows you to connect multiple functions together without the use of multiple parentheses. It stands for “and then”. You can easily connected multiple commands in a single pipe chain. This is especially useful when data cleaning. Here is an example of multiple functions in a chain. The following code:
NA
values in “island”Note: R 4.1 introduced a native pipe operator |>
that works similarly to the tidyverse
pipe operator %>%
. I have not tested or updated the syntax throughout the course with the native pipe operator yet, but it should work in most instances similar to the tidyverse pipe.
penguins %>%
select(bill_length_mm, flipper_length_mm, island) %>%
drop_na(island) %>%
mutate(combined = bill_length_mm * flipper_length_mm,
id = row_number()) %>%
filter(bill_length_mm > 25) %>%
summarize(combined_mean = mean(combined, na.rm=T))
## combined_mean
## 1 8874.812
Without piping (i.e., Base R), you would need to write this code as the following, which uses multiple data objects: Note: $
is used to subset or name an element of a list
penguins_subset <- subset(penguins, !is.na(island), select=c(bill_length_mm, flipper_length_mm, island))
penguins_subset$combined <- penguins_subset$bill_length_mm* penguins_subset$flipper_length_mm
penguins_subset$id <- seq.int(nrow(penguins_subset))
penguins_subset <- subset(penguins_subset, bill_length_mm > 25)
mean(penguins_subset$combined, na.rm=T)
## [1] 8874.812
In summary, there are many ways to write R
code. Sometimes %>%
pipes make it easier. You may use whichever you prefer.
Quickly view your column names (variables) with the names
function:
names(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
View the types of variables using str
:
str(penguins)
## 'data.frame': 344 obs. of 8 variables:
## $ species : chr "Adelie" "Adelie" "Adelie" "Adelie" ...
## $ island : chr "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
## $ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : chr "male" "female" "female" NA ...
## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Variable types:
summary
from base R
For numeric variables, this will return quartiles, median, means, mix, max, and NAs. For nominal/categorical variables, this will return counts (n’s).
summary(penguins)
## species island bill_length_mm bill_depth_mm
## Length:344 Length:344 Min. :32.10 Min. :13.10
## Class :character Class :character 1st Qu.:39.23 1st Qu.:15.60
## Mode :character Mode :character Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 Length:344 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 Class :character 1st Qu.:2007
## Median :197.0 Median :4050 Mode :character Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
Use the $
dollar sign to refer to a specific variable.
summary(penguins$bill_length_mm)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 32.10 39.23 44.45 43.92 48.50 59.60 2
describe
from the psych
packageYou can load the psych
package and use describe to get mean, sd, median, and other variables. Note: if you haven’t installed the psych
package run install.packages("psych")
before running code below.
library(psych)
psych::describe(penguins)
## vars n mean sd median trimmed mad min max
## species* 1 344 1.92 0.89 2.00 1.90 1.48 1.0 3.0
## island* 2 344 1.66 0.73 2.00 1.58 1.48 1.0 3.0
## bill_length_mm 3 342 43.92 5.46 44.45 43.91 7.04 32.1 59.6
## bill_depth_mm 4 342 17.15 1.97 17.30 17.17 2.22 13.1 21.5
## flipper_length_mm 5 342 200.92 14.06 197.00 200.34 16.31 172.0 231.0
## body_mass_g 6 342 4201.75 801.95 4050.00 4154.01 889.56 2700.0 6300.0
## sex* 7 333 1.50 0.50 2.00 1.51 0.00 1.0 2.0
## year 8 344 2008.03 0.82 2008.00 2008.04 1.48 2007.0 2009.0
## range skew kurtosis se
## species* 2.0 0.16 -1.73 0.05
## island* 2.0 0.61 -0.91 0.04
## bill_length_mm 27.5 0.05 -0.89 0.30
## bill_depth_mm 8.4 -0.14 -0.92 0.11
## flipper_length_mm 59.0 0.34 -1.00 0.76
## body_mass_g 3600.0 0.47 -0.74 43.36
## sex* 1.0 -0.02 -2.01 0.03
## year 2.0 -0.05 -1.51 0.04
describe(penguins$flipper_length_mm)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 342 200.92 14.06 197 200.34 16.31 172 231 59 0.34 -1 0.76
We can use describeBy(numeric_variable, group=grouping_variable)
from the psych
package if you want the descriptive statistics for individual groups:
describeBy(penguins$flipper_length_mm, group=penguins$species)
##
## Descriptive statistics by group
## group: Adelie
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 151 189.95 6.54 190 189.93 7.41 172 210 38 0.09 0.24 0.53
## ------------------------------------------------------------
## group: Chinstrap
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 68 195.82 7.13 196 195.75 7.41 178 212 34 -0.01 -0.13 0.86
## ------------------------------------------------------------
## group: Gentoo
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 123 217.19 6.48 216 216.83 5.93 203 231 28 0.39 -0.64 0.58
skim
from the skimr
packageskim
will give you number missing, mean, sd, percentile, and a small histogram. skim
can be used on all variables or a specific variable, just as demonstrated above.
library(skimr)
#Tidyverse
penguins %>%
skim(body_mass_g)
Name | Piped data |
Number of rows | 344 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700 | 3550 | 4050 | 4750 | 6300 | ▃▇▆▃▂ |
#Base R
skim(penguins$body_mass_g)
Name | penguins$body_mass_g |
Number of rows | 344 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
data | 2 | 0.99 | 4201.75 | 801.95 | 2700 | 3550 | 4050 | 4750 | 6300 | ▃▇▆▃▂ |
We can also use this function for groups if we first use group_by
:
#Tidyverse
penguins %>%
group_by(species) %>%
skim(body_mass_g)
Name | Piped data |
Number of rows | 344 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | species |
Variable type: numeric
skim_variable | species | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
body_mass_g | Adelie | 1 | 0.99 | 3700.66 | 458.57 | 2850 | 3350.0 | 3700 | 4000 | 4775 | ▅▇▇▃▂ |
body_mass_g | Chinstrap | 0 | 1.00 | 3733.09 | 384.34 | 2700 | 3487.5 | 3700 | 3950 | 4800 | ▁▅▇▃▁ |
body_mass_g | Gentoo | 1 | 0.99 | 5076.02 | 504.12 | 3950 | 4700.0 | 5000 | 5500 | 6300 | ▃▇▇▇▂ |
summarize
from the tidyverse
packageYou can make your own custom summary using summarize
combined with base R
functions of mean
, median
, min
, max
, sd
, and so on. Add na.rm=T
to remove NA values.
#Tidyverse
penguins %>%
summarize(body_mass_mean = mean(body_mass_g, na.rm=T),
body_mass_sd = sd(body_mass_g, na.rm = T),
bill_length_mean = mean(bill_length_mm, na.rm=T),
bill_length_sd = sd(bill_length_mm, na.rm=T))
## body_mass_mean body_mass_sd bill_length_mean bill_length_sd
## 1 4201.754 801.9545 43.92193 5.459584
#Base R
cbind(mean(penguins$body_mass_g, na.rm=T), sd(penguins$body_mass_g, na.rm=T), mean(penguins$bill_length_mm, na.rm=T), sd(penguins$bill_length_mm, na.rm=T))
## [,1] [,2] [,3] [,4]
## [1,] 4201.754 801.9545 43.92193 5.459584
We can also use group_by
to summarize by group.
#Tidyverse
penguins %>%
group_by(species) %>% #add one or more groups
summarize(body_mass_mean = mean(body_mass_g, na.rm=T),
body_mass_sd = sd(body_mass_g, na.rm = T),
bill_length_mean = mean(bill_length_mm, na.rm=T),
bill_length_sd = sd(bill_length_mm, na.rm=T))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 5
## species body_mass_mean body_mass_sd bill_length_mean bill_length_sd
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie 3701. 459. 38.8 2.66
## 2 Chinstrap 3733. 384. 48.8 3.34
## 3 Gentoo 5076. 504. 47.5 3.08
#Base R
penguins_noNA <- na.omit(penguins) #remove NA
by(penguins_noNA$body_mass_g, penguins_noNA$species, mean)
## penguins_noNA$species: Adelie
## [1] 3706.164
## ------------------------------------------------------------
## penguins_noNA$species: Chinstrap
## [1] 3733.088
## ------------------------------------------------------------
## penguins_noNA$species: Gentoo
## [1] 5092.437
by(penguins_noNA$body_mass_g, penguins_noNA$species, sd)
## penguins_noNA$species: Adelie
## [1] 458.6201
## ------------------------------------------------------------
## penguins_noNA$species: Chinstrap
## [1] 384.3351
## ------------------------------------------------------------
## penguins_noNA$species: Gentoo
## [1] 501.4762
by(penguins_noNA$bill_length_mm, penguins_noNA$species, mean)
## penguins_noNA$species: Adelie
## [1] 38.82397
## ------------------------------------------------------------
## penguins_noNA$species: Chinstrap
## [1] 48.83382
## ------------------------------------------------------------
## penguins_noNA$species: Gentoo
## [1] 47.56807
by(penguins_noNA$bill_length_mm, penguins_noNA$species, sd)
## penguins_noNA$species: Adelie
## [1] 2.662597
## ------------------------------------------------------------
## penguins_noNA$species: Chinstrap
## [1] 3.339256
## ------------------------------------------------------------
## penguins_noNA$species: Gentoo
## [1] 3.106116
tabyl
from the janitor
package (https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html)tabyl
can get frequencies for single variables. The table
function in Base R will so create a frequency table (but it isn’t as nice).
library(janitor)
tabyl(penguins$species)
## penguins$species n percent
## Adelie 152 0.4418605
## Chinstrap 68 0.1976744
## Gentoo 124 0.3604651
table(penguins$species)
##
## Adelie Chinstrap Gentoo
## 152 68 124
# to get proportions you need to wrap table in prob.table
prop.table(table(penguins$species))
##
## Adelie Chinstrap Gentoo
## 0.4418605 0.1976744 0.3604651
tabyl
can also get crosstabs or frequencies by groups:
#Tidyverse
penguins %>%
tabyl(species, island)
## species Biscoe Dream Torgersen
## Adelie 44 56 52
## Chinstrap 0 68 0
## Gentoo 124 0 0
# Base R
table(penguins$species, penguins$island)
##
## Biscoe Dream Torgersen
## Adelie 44 56 52
## Chinstrap 0 68 0
## Gentoo 124 0 0
You can easily add row and column totals like so:
#Tidyverse
penguins %>%
tabyl(species, island) %>%
adorn_totals(c("row", "col"))
## species Biscoe Dream Torgersen Total
## Adelie 44 56 52 152
## Chinstrap 0 68 0 68
## Gentoo 124 0 0 124
## Total 168 124 52 344
#Base R
table <- tabyl(penguins$species, penguins$island) #notice the error
## Error in tabyl.default(penguins$species, penguins$island): The value supplied to the "show_na" argument must be TRUE or FALSE.
##
## Did you try to call tabyl on two vectors, like tabyl(data$var1, data$var2) ? To create a two-way tabyl, the two vectors must be in the same data.frame, and the function should be called like this:
##
## tabyl(data, var1, var2)
## or
## data %>% tabyl(var1, var2).
##
## See ?tabyl for more.
table <- tabyl(penguins, species, island)
adorn_totals(table, c("row", "col"))
## species Biscoe Dream Torgersen Total
## Adelie 44 56 52 152
## Chinstrap 0 68 0 68
## Gentoo 124 0 0 124
## Total 168 124 52 344
Use adorn_
commands to use percentages, rounding, etc:
#Tidyverse
penguins %>%
tabyl(species, island) %>%
adorn_totals(c("row", "col")) %>%
adorn_percentages() %>%
adorn_rounding()
## species Biscoe Dream Torgersen Total
## Adelie 0.3 0.4 0.3 1
## Chinstrap 0.0 1.0 0.0 1
## Gentoo 1.0 0.0 0.0 1
## Total 0.5 0.4 0.2 1
#Base R
table <- tabyl(penguins, species, island)
table <- adorn_totals(table, c("row", "col"))
table <- adorn_percentages(table)
adorn_rounding(table)
## species Biscoe Dream Torgersen Total
## Adelie 0.3 0.4 0.3 1
## Chinstrap 0.0 1.0 0.0 1
## Gentoo 1.0 0.0 0.0 1
## Total 0.5 0.4 0.2 1
You can easily see frequencies by making a histogram with the hist
command:
hist(penguins$bill_depth_mm)
Read more, about descriptives in R Reproducible statistics for psychologists with R: Descriptives.
The following functions will help you change and manipulate your data.
Use the rename("new name" = old_name)
function:
names(penguins) #print variable names
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
#Tidyverse
penguins <- penguins %>%
rename("weight" = body_mass_g)
names(penguins) #check that rename worked
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "weight"
## [7] "sex" "year"
#Base R
names(penguins)[names(penguins)=="body_mass_g"] <- "weight" #by variable name
names(penguins)[6] <- "weight" #by column number
names(penguins) #check that rename worked
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "weight"
## [7] "sex" "year"
You can use recode
inside a mutate
(change) function to change how individual factor levels are assigned. For example, for penguin sex, there is female, male, and NA. Let’s change female to 0 and male to 1 in the variable sex 2.
#Tidyverse
penguins %>%
count(sex)
## sex n
## 1 female 165
## 2 male 168
## 3 <NA> 11
penguins %>%
mutate(male = recode(sex,
male = 1,
female = 0)) %>%
select(sex, male) %>% # select the two sex variables to compare
head() #head lists the first 6 parts
## sex male
## 1 male 1
## 2 female 0
## 3 female 0
## 4 <NA> NA
## 5 female 0
## 6 male 1
#Base R
table(penguins$sex)
##
## female male
## 165 168
penguins$male <- ifelse(penguins$sex=="male", 1, ifelse(penguins$sex=="female",0, NA))
table(penguins$male)
##
## 0 1
## 165 168
We can use mutate
to change and create variables. For example, in the penguins
data set, weight is in grams. Let’s convert that to pounds. 1 gram = 0.00220462 pounds.
#Tidyverse
penguins <- penguins %>%
mutate(weight_lbs = weight * 0.00220462)
#Base R
penguins$weight_lbs <- penguins$weight*0.00220462
For more practice, check out some common data manipulation tools in these interactive exercises.
Use cor
from base R
for most correlations (e.g., Pearson’s r, Kendall’s \(\tau\), Spearman’s \(\rho\)). Arguments include “method” and “use”. Method has options for Pearson, Kendall, and Spearman correlations. If there is missing data, like below, you can use pairwise complete observations or only complete observations. Type ?cor
in the console for more information.
penguins %>%
select(where(is.numeric)) %>% #selecting only numeric variables
cor(method = "pearson",
use = "complete.obs") %>%
round(3)
## bill_length_mm bill_depth_mm flipper_length_mm weight year
## bill_length_mm 1.000 -0.229 0.653 0.589 0.033
## bill_depth_mm -0.229 1.000 -0.578 -0.472 -0.048
## flipper_length_mm 0.653 -0.578 1.000 0.873 0.151
## weight 0.589 -0.472 0.873 1.000 0.022
## year 0.033 -0.048 0.151 0.022 1.000
## male 0.344 0.373 0.255 0.425 0.000
## weight_lbs 0.589 -0.472 0.873 1.000 0.022
## male weight_lbs
## bill_length_mm 0.344 0.589
## bill_depth_mm 0.373 -0.472
## flipper_length_mm 0.255 0.873
## weight 0.425 1.000
## year 0.000 0.022
## male 1.000 0.425
## weight_lbs 0.425 1.000
#Base R
cor(penguins, use="complete.obs") #notice the error 'x' must be numeric. This indicates that we have non-numeric variables (i.e., strings, factors)
## Error in cor(penguins, use = "complete.obs"): 'x' must be numeric
str(penguins) #anything that is not "num" or "int" can not be used in a correlation
## 'data.frame': 344 obs. of 10 variables:
## $ species : chr "Adelie" "Adelie" "Adelie" "Adelie" ...
## $ island : chr "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
## $ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int 181 186 195 NA 193 190 181 195 193 190 ...
## $ weight : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : chr "male" "female" "female" NA ...
## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## $ male : num 1 0 0 NA 0 1 0 1 NA NA ...
## $ weight_lbs : num 8.27 8.38 7.17 NA 7.61 ...
#Filter using only numeric
round(cor(Filter(is.numeric, penguins), use="complete.obs"),3)
## bill_length_mm bill_depth_mm flipper_length_mm weight year
## bill_length_mm 1.000 -0.229 0.653 0.589 0.033
## bill_depth_mm -0.229 1.000 -0.578 -0.472 -0.048
## flipper_length_mm 0.653 -0.578 1.000 0.873 0.151
## weight 0.589 -0.472 0.873 1.000 0.022
## year 0.033 -0.048 0.151 0.022 1.000
## male 0.344 0.373 0.255 0.425 0.000
## weight_lbs 0.589 -0.472 0.873 1.000 0.022
## male weight_lbs
## bill_length_mm 0.344 0.589
## bill_depth_mm 0.373 -0.472
## flipper_length_mm 0.255 0.873
## weight 0.425 1.000
## year 0.000 0.022
## male 1.000 0.425
## weight_lbs 0.425 1.000
If you just need the correlation coefficient between two variables, you can correlate only those variables using cor(data$var1, data$var2, method="method", use="use")
. Note the use of round()
to round results to 3 decimal places.
#Tidyverse
cor(penguins$bill_length_mm, penguins$flipper_length_mm,
method = "pearson",
use = "complete.obs") %>%
round(3)
## [1] 0.656
#Base R
round(cor(penguins$bill_length_mm, penguins$flipper_length_mm, use = "complete.obs"),3)
## [1] 0.656
You can use cor.test
to get the correlation coefficient, a p-value, and a confidence interval for the correlation:
cor.test(penguins$bill_length_mm, penguins$flipper_length_mm,
method = "pearson",
use = "complete.obs")
##
## Pearson's product-moment correlation
##
## data: penguins$bill_length_mm and penguins$flipper_length_mm
## t = 16.034, df = 340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5912769 0.7126403
## sample estimates:
## cor
## 0.6561813
The results of a Pearson r correlation with variable 1 and variable 2 showed that these two variables were significantly positively related, \(r=.\#\#,p = .\#\#, r^2 = .\#\#\).
The results of a Pearson r correlation between penguin bill length and flipper length showed that these two variables were significantly positively related, \(r = .66\), \(p < .001\), \(r^2 = .42.\)
Read more about correlation in R:
Recall that crosstabs can easily be displayed with tabyl
as well as in Base R using table
:
#Tidyverse and tabyl
penguins %>%
tabyl(species, island)
## species Biscoe Dream Torgersen
## Adelie 44 56 52
## Chinstrap 0 68 0
## Gentoo 124 0 0
#Base R
table(penguins$species, penguins$island)
##
## Biscoe Dream Torgersen
## Adelie 44 56 52
## Chinstrap 0 68 0
## Gentoo 124 0 0
To calculate a chi-square statistics, we can call janitor
’s chisq.test
function. Note that because base R
also has a chisq.test
function, we will use janitor::chisq.test()
. Both packages function the same way, only the data is entered differently. (Base R
’s chisq.test
cannot be used with %>%
pipes and so the data must be entered using $
).
For either function, you must give it a crosstabs table and use chisq.test(crosstabs table)
.
# Tidyverse, tably, janitor
penguins %>%
tabyl(species, island) %>%
janitor::chisq.test()
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 299.55, df = 4, p-value < 2.2e-16
#Base R
stats::chisq.test(penguins$species, penguins$island)
##
## Pearson's Chi-squared test
##
## data: penguins$species and penguins$island
## X-squared = 299.55, df = 4, p-value < 2.2e-16
Read more about chi square in R:
We can use the t.test
function from base R
to conduct an independent t-test with one grouping variable (IV) that has 2 levels and one dependent variable (DV) that is continuous.
The t.test
function has several arguments. For an independent-samples t-test, you can use: t.test(DV ~ IV, data=data, var.equal=FALSE)
To assess whether variances are similar across groups, we can use the car
package and the leveneTest(DV ~ IV, data=data)
for this:
library(car)
leveneTest(weight_lbs ~ sex, # DV ~ IV
data=penguins)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 6.0586 0.01435 *
## 331
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result showed a significant difference in variances (i.e., they are not equal).
t.test(weight_lbs ~ sex, # DV ~ IV
data=penguins,
var.equal = FALSE) # Note we choose var.equal = FALSE because of the Levene's test above. FALSE is default so if we did not include this, it would default to a Welch's t-test
##
## Welch Two Sample t-test
##
## data: weight_lbs by sex
## t = -8.5545, df = 323.9, p-value = 4.794e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.853156 -1.160171
## sample estimates:
## mean in group female mean in group male
## 8.514844 10.021507
Optional: We can get a neater table of data by using tidy
from the broom
package:
#Optional
library(broom)
t.test(weight_lbs ~ sex, # DV ~ IV
data=penguins,
paired=F,
var.equal = F) %>%
tidy()
## # A tibble: 1 x 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -1.51 8.51 10.0 -8.55 4.79e-16 324. -1.85 -1.16
## # ... with 2 more variables: method <chr>, alternative <chr>
# using psych package
psych::cohen.d(penguins[, c("weight_lbs", "sex")], "sex")
## Call: psych::cohen.d(x = penguins[, c("weight_lbs", "sex")], group = "sex")
## Cohen d statistic of difference between two means
## lower effect upper
## weight_lbs 0.7 0.94 1.18
##
## Multivariate (Mahalanobis) distance between groups
## [1] 0.94
## r equivalent of difference between two means
## weight_lbs
## 0.42
Read more about independent t-tests in R:
For more information on performing t-tests in R see: https://bookdown.org/ndphillips/YaRrr/t-test-t-test.html
You can use the t-test command above, adding paired=T
to the t.test
command. The penguins
data set does not have any paired data, so we will simulate some by adding random variation to each value:
penguins <- penguins %>%
mutate(weight_lbs2 = weight_lbs + sample(-10:10, n(), replace = TRUE))
(paired <- t.test(penguins$weight_lbs, penguins$weight_lbs2, paired=T))
##
## Paired t-test
##
## data: penguins$weight_lbs and penguins$weight_lbs2
## t = -1.164, df = 341, p-value = 0.2453
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.006741 0.258203
## sample estimates:
## mean of the differences
## -0.374269
# paired-samples Cohen's d
difference <- penguins$weight_lbs - penguins$weight_lbs2
(cohens_d <- mean(difference, na.rm=T)/sd(difference, na.rm=T))
## [1] -0.06293927
Read more about dependent t-tests in R:
You can use aov
from the stats
package for a oneway between-subjects ANOVA, like this: aov(DV ~ group, data = data)
. We can then use summary.aov
to get the results. Note: the aov
function computes Type I Sums of Squares. If you want to repoduce the output you get from SPSS, you’ll need to use the car
package and the Anova
function to compute Type III Sums of Squares. For balanced designs, these give the same results. See the following for more information on ANOVA Sums of Squares calculation: https://www.r-bloggers.com/2011/03/anova-%E2%80%93-type-iiiiii-ss-explained/.
For more information on performing ANOVA in R see: https://bookdown.org/ndphillips/YaRrr/anova.html
weight_analysis <- aov(weight_lbs ~ island, data = penguins)
summary.aov(weight_analysis)
## Df Sum Sq Mean Sq F value Pr(>F)
## island 2 419.5 209.76 110 <2e-16 ***
## Residuals 339 646.4 1.91
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness
# Type III Sums of Squares using the `Anova` function from the `car` package
(weight_analysis_2 <- car::Anova(lm(weight_lbs ~ island, data=penguins), type="III"))
## Anova Table (Type III tests)
##
## Response: weight_lbs
## Sum Sq Df F value Pr(>F)
## (Intercept) 18052.4 1 9467.55 < 2.2e-16 ***
## island 419.5 2 110.01 < 2.2e-16 ***
## Residuals 646.4 339
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Welch's ANOVA
oneway.test(weight_lbs ~ island, data = penguins)
##
## One-way analysis of means (not assuming equal variances)
##
## data: weight_lbs and island
## F = 106.97, num df = 2.00, denom df = 150.41, p-value < 2.2e-16
# Levene's test
car::leveneTest(weight_lbs ~ island, data = penguins)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 25.064 7.053e-11 ***
## 339
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You can perform post-hoc tests using different functions. For example, a Tukey test:
TukeyHSD(weight_analysis)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weight_lbs ~ island, data = penguins)
##
## $island
## diff lwr upr p adj
## Dream-Biscoe -2.21148681 -2.5968254 -1.826148 0.0000000
## Torgersen-Biscoe -2.22588447 -2.7459404 -1.705829 0.0000000
## Torgersen-Dream -0.01439766 -0.5551373 0.526342 0.9978364
A pairwise t-test:
pairwise.t.test(penguins$weight_lbs, penguins$island,
p.adjust.method = "BH")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: penguins$weight_lbs and penguins$island
##
## Biscoe Dream
## Dream <2e-16 -
## Torgersen <2e-16 0.95
##
## P value adjustment method: BH
To calculate an effect size, such as eta squared or partial eta squared, you can use the sjstats
package. (Note: in an ANOVA model with one IV, eta squared = partial eta squared.)
library(sjstats)
eta_sq(weight_analysis, partial = T)
## term partial.etasq
## 1 island 0.394
# Using effectsize package
effectsize::omega_squared(weight_analysis)
## # Effect Size for ANOVA
##
## Parameter | Omega2 | 95% CI
## ---------------------------------
## island | 0.39 | [0.32, 1.00]
##
## - One-sided CIs: upper bound fixed at (1).
If you want to save the data and objects that you have loaded, cleaned, and analyzed, you may do so in two ways:
You can select the data objects and save them using save
and saving it as a .Rdata file:
save(penguins, t_test_results, weight_analysis, file="my_data.Rdata")
You may load this data back into your R
environment using load
:
load("my_data.Rdata")
You can also save all the entire environment (workspace) using save.image
:
save.image(file = "all_data.Rdata")
You can load it the same way as above.
You can also save data frames as CSV files with write.csv
:
write.csv(penguins, file="penguins_data.csv")
You can save your code and output together by compiling your document. Navigate to File -> Compile Document and choose HTML, Word, or PDF. Note: if there are any errors in your code, it will not compile until you fix them. Reading the error messages can help.