This handout demonstrates how to conduct some of the analyses we learned in EDPY 577/677 using R code instead of the SPSS point-and-click method.

R Set Up

It is recommended that you use an R Project for each course or major assignment. This will avoid having to set a working directory and write long file paths. You can move your files anywhere. As long as the R Project .rproj file is included, there should be no issues.

To create an R Project, simply go to File > New Project…. If you want to create a new directory, select New Directory. On the next screen, choose New Project, locate where you want to put your project, and click Create Project. If you have a directory already and you just want to make it into a project, click Existing Directory. The Version Control option is useful if you will be placing your project on Github.

To access a folder within your R Project folder, you can use a simple path such as read.csv("data/data.csv"). To access a file outside of your R Project folder, you can use ..: read.csv("../other_data/data.csv").

Packages

Installing Packages

Packages are installed by typing install.packages("package name") in the R console. To replicate the examples in this document, you will need to install the following packages (Note: you only need to install packages one time):

install.packages("palmerpenguins") # for example data
install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")

Loading Packages

Loading a package can be done with the following code: library(package name). It is recommended to load all packages at the top of your script file.

Loading Data

You can load many different data types using R:

CSV files with read.csv from the base R package:
- read.csv("name_of_file.csv")
- read.csv("http://www.website.com/internet/file.csv)
Excel files with read_excel from the readxl package (you may need to modify function defaults for read_excel to work properly). :
- read_excel("name_of_file.xlsx")
- Personally, I like to convert Excel files to a CSV file and load the CSV file into R
The haven package reads data from other statistical software packages:
- read_spss("name_of_file.sav") or read_sav reads SPSS data files
- read_dta reads Stata data files
- read_sas reads SAS data files
- haven can also save a data set as an SPSS, Stata, or SAS data file using the write_sav, write_dta, and write_sas functions

Data Objects

R is an object-oriented program language. That means that results are stored as objects that you can retrieve and manipulate later. As such, you save data and results as R objects. You can use = or <-. <- is recommended. Examples: Note: c() means “combine”

# character vector
phrase <- "Good luck in Stats III" 

#numeric vector
answers <- c(6*5, 6^2, 5*sqrt(23))

# a data frame
dataframe <- data.frame(col_1 = c(1,2,3),
                        col_2 = c("A", "B", "C"))

# assign data to an object named penguins
penguins <- read.csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")

# assign results from a t-test to t_test_results
t_test_results <- t.test(body_mass_g ~ sex, data=penguins)

For more practice, check out this interactive tutorial on R programming basics.

The Pipe `%>%`

In some of the code below you will see the pipe operator, %>%. This is a function from the tidyverse package that allows you to connect multiple functions together without the use of multiple parentheses. It stands for “and then”. You can easily connected multiple commands in a single pipe chain. This is especially useful when data cleaning. Here is an example of multiple functions in a chain. The following code:

Accesses the penguins data
Selects 3 columns
Drops any rows with NA values in “island”
Creates a new variable named “combined” and a row ID variable named “id”
Filters the data to only include penguins with bill length greater than 25 mm
Summarizes the data by creating a combined mean value

Note: R 4.1 introduced a native pipe operator |> that works similarly to the tidyverse pipe operator %>%. I have not tested or updated the syntax throughout the course with the native pipe operator yet, but it should work in most instances similar to the tidyverse pipe.

penguins %>%
  select(bill_length_mm, flipper_length_mm, island) %>%
  drop_na(island) %>%
  mutate(combined = bill_length_mm * flipper_length_mm,
         id = row_number()) %>%
  filter(bill_length_mm > 25) %>%
  summarize(combined_mean = mean(combined, na.rm=T))

##   combined_mean
## 1      8874.812

Without piping (i.e., Base R), you would need to write this code as the following, which uses multiple data objects: Note: $ is used to subset or name an element of a list

penguins_subset <- subset(penguins, !is.na(island), select=c(bill_length_mm, flipper_length_mm, island))

penguins_subset$combined <- penguins_subset$bill_length_mm* penguins_subset$flipper_length_mm

penguins_subset$id <- seq.int(nrow(penguins_subset)) 

penguins_subset <- subset(penguins_subset, bill_length_mm > 25)

mean(penguins_subset$combined, na.rm=T)

## [1] 8874.812

In summary, there are many ways to write R code. Sometimes %>% pipes make it easier. You may use whichever you prefer.

Descriptive Statistics

Variable Names

Quickly view your column names (variables) with the names function:

names(penguins)

## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"

Variable Types

View the types of variables using str:

str(penguins)

## 'data.frame':    344 obs. of  8 variables:
##  $ species          : chr  "Adelie" "Adelie" "Adelie" "Adelie" ...
##  $ island           : chr  "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : chr  "male" "female" "female" NA ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Variable types:

num - numeric
int - integer (1L, 2L; same as numeric but no decimals)
dbl - double (double precision; same as numeric)
factor - categorical
chr - character text
logical - TRUE/FALSE

Calculating Means

`summary` from base `R`

For numeric variables, this will return quartiles, median, means, mix, max, and NAs. For nominal/categorical variables, this will return counts (n’s).

Summarize all variables

summary(penguins)

##    species             island          bill_length_mm  bill_depth_mm  
##  Length:344         Length:344         Min.   :32.10   Min.   :13.10  
##  Class :character   Class :character   1st Qu.:39.23   1st Qu.:15.60  
##  Mode  :character   Mode  :character   Median :44.45   Median :17.30  
##                                        Mean   :43.92   Mean   :17.15  
##                                        3rd Qu.:48.50   3rd Qu.:18.70  
##                                        Max.   :59.60   Max.   :21.50  
##                                        NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex                 year     
##  Min.   :172.0     Min.   :2700   Length:344         Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   Class :character   1st Qu.:2007  
##  Median :197.0     Median :4050   Mode  :character   Median :2008  
##  Mean   :200.9     Mean   :4202                      Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                      3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                      Max.   :2009  
##  NA's   :2         NA's   :2

Summarize a single variable

Use the $ dollar sign to refer to a specific variable.

summary(penguins$bill_length_mm)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   32.10   39.23   44.45   43.92   48.50   59.60       2

`describe` from the `psych` package

You can load the psych package and use describe to get mean, sd, median, and other variables. Note: if you haven’t installed the psych package run install.packages("psych") before running code below.

Describe all variables

library(psych)

psych::describe(penguins)

##                   vars   n    mean     sd  median trimmed    mad    min    max
## species*             1 344    1.92   0.89    2.00    1.90   1.48    1.0    3.0
## island*              2 344    1.66   0.73    2.00    1.58   1.48    1.0    3.0
## bill_length_mm       3 342   43.92   5.46   44.45   43.91   7.04   32.1   59.6
## bill_depth_mm        4 342   17.15   1.97   17.30   17.17   2.22   13.1   21.5
## flipper_length_mm    5 342  200.92  14.06  197.00  200.34  16.31  172.0  231.0
## body_mass_g          6 342 4201.75 801.95 4050.00 4154.01 889.56 2700.0 6300.0
## sex*                 7 333    1.50   0.50    2.00    1.51   0.00    1.0    2.0
## year                 8 344 2008.03   0.82 2008.00 2008.04   1.48 2007.0 2009.0
##                    range  skew kurtosis    se
## species*             2.0  0.16    -1.73  0.05
## island*              2.0  0.61    -0.91  0.04
## bill_length_mm      27.5  0.05    -0.89  0.30
## bill_depth_mm        8.4 -0.14    -0.92  0.11
## flipper_length_mm   59.0  0.34    -1.00  0.76
## body_mass_g       3600.0  0.47    -0.74 43.36
## sex*                 1.0 -0.02    -2.01  0.03
## year                 2.0 -0.05    -1.51  0.04

Describe a single variables

describe(penguins$flipper_length_mm)

##    vars   n   mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 342 200.92 14.06    197  200.34 16.31 172 231    59 0.34       -1 0.76

Describe by a group

We can use describeBy(numeric_variable, group=grouping_variable) from the psych package if you want the descriptive statistics for individual groups:

describeBy(penguins$flipper_length_mm, group=penguins$species)

## 
##  Descriptive statistics by group 
## group: Adelie
##    vars   n   mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 151 189.95 6.54    190  189.93 7.41 172 210    38 0.09     0.24 0.53
## ------------------------------------------------------------ 
## group: Chinstrap
##    vars  n   mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 68 195.82 7.13    196  195.75 7.41 178 212    34 -0.01    -0.13 0.86
## ------------------------------------------------------------ 
## group: Gentoo
##    vars   n   mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 123 217.19 6.48    216  216.83 5.93 203 231    28 0.39    -0.64 0.58

`skim` from the `skimr` package

skim will give you number missing, mean, sd, percentile, and a small histogram. skim can be used on all variables or a specific variable, just as demonstrated above.

library(skimr)

#Tidyverse
penguins %>%
  skim(body_mass_g)

Data summary
Name	Piped data
Number of rows	344
Number of columns	8
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
body_mass_g	2	0.99	4201.75	801.95	2700	3550	4050	4750	6300	▃▇▆▃▂

#Base R
skim(penguins$body_mass_g)

Data summary
Name	penguins$body_mass_g
Number of rows	344
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
data	2	0.99	4201.75	801.95	2700	3550	4050	4750	6300	▃▇▆▃▂

We can also use this function for groups if we first use group_by:

#Tidyverse
penguins %>%
  group_by(species) %>%
  skim(body_mass_g)

Data summary
Name	Piped data
Number of rows	344
Number of columns	8
_______________________
Column type frequency:
numeric	1
________________________
Group variables	species

Variable type: numeric

skim_variable	species	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
body_mass_g	Adelie	1	0.99	3700.66	458.57	2850	3350.0	3700	4000	4775	▅▇▇▃▂
body_mass_g	Chinstrap	0	1.00	3733.09	384.34	2700	3487.5	3700	3950	4800	▁▅▇▃▁
body_mass_g	Gentoo	1	0.99	5076.02	504.12	3950	4700.0	5000	5500	6300	▃▇▇▇▂

`summarize` from the `tidyverse` package

You can make your own custom summary using summarize combined with base R functions of mean, median, min, max, sd, and so on. Add na.rm=T to remove NA values.

#Tidyverse
penguins %>%
  summarize(body_mass_mean = mean(body_mass_g, na.rm=T),
            body_mass_sd = sd(body_mass_g, na.rm = T),
            bill_length_mean = mean(bill_length_mm, na.rm=T),
            bill_length_sd = sd(bill_length_mm, na.rm=T))

##   body_mass_mean body_mass_sd bill_length_mean bill_length_sd
## 1       4201.754     801.9545         43.92193       5.459584

#Base R
cbind(mean(penguins$body_mass_g, na.rm=T), sd(penguins$body_mass_g, na.rm=T), mean(penguins$bill_length_mm, na.rm=T), sd(penguins$bill_length_mm, na.rm=T))

##          [,1]     [,2]     [,3]     [,4]
## [1,] 4201.754 801.9545 43.92193 5.459584

Summarize by a group

We can also use group_by to summarize by group.

#Tidyverse
penguins %>%
  group_by(species) %>% #add one or more groups
  summarize(body_mass_mean = mean(body_mass_g, na.rm=T),
            body_mass_sd = sd(body_mass_g, na.rm = T),
            bill_length_mean = mean(bill_length_mm, na.rm=T),
            bill_length_sd = sd(bill_length_mm, na.rm=T))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 5
##   species   body_mass_mean body_mass_sd bill_length_mean bill_length_sd
##   <chr>              <dbl>        <dbl>            <dbl>          <dbl>
## 1 Adelie             3701.         459.             38.8           2.66
## 2 Chinstrap          3733.         384.             48.8           3.34
## 3 Gentoo             5076.         504.             47.5           3.08

#Base R
penguins_noNA <- na.omit(penguins) #remove NA
by(penguins_noNA$body_mass_g, penguins_noNA$species, mean)

## penguins_noNA$species: Adelie
## [1] 3706.164
## ------------------------------------------------------------ 
## penguins_noNA$species: Chinstrap
## [1] 3733.088
## ------------------------------------------------------------ 
## penguins_noNA$species: Gentoo
## [1] 5092.437

by(penguins_noNA$body_mass_g, penguins_noNA$species, sd)

## penguins_noNA$species: Adelie
## [1] 458.6201
## ------------------------------------------------------------ 
## penguins_noNA$species: Chinstrap
## [1] 384.3351
## ------------------------------------------------------------ 
## penguins_noNA$species: Gentoo
## [1] 501.4762

by(penguins_noNA$bill_length_mm, penguins_noNA$species, mean)

## penguins_noNA$species: Adelie
## [1] 38.82397
## ------------------------------------------------------------ 
## penguins_noNA$species: Chinstrap
## [1] 48.83382
## ------------------------------------------------------------ 
## penguins_noNA$species: Gentoo
## [1] 47.56807

by(penguins_noNA$bill_length_mm, penguins_noNA$species, sd)

## penguins_noNA$species: Adelie
## [1] 2.662597
## ------------------------------------------------------------ 
## penguins_noNA$species: Chinstrap
## [1] 3.339256
## ------------------------------------------------------------ 
## penguins_noNA$species: Gentoo
## [1] 3.106116

Frequencies

`tabyl` from the `janitor` package (https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html)

tabyl can get frequencies for single variables. The table function in Base R will so create a frequency table (but it isn’t as nice).

library(janitor)

tabyl(penguins$species)

##  penguins$species   n   percent
##            Adelie 152 0.4418605
##         Chinstrap  68 0.1976744
##            Gentoo 124 0.3604651

table(penguins$species)

## 
##    Adelie Chinstrap    Gentoo 
##       152        68       124

# to get proportions you need to wrap table in prob.table
prop.table(table(penguins$species))

## 
##    Adelie Chinstrap    Gentoo 
## 0.4418605 0.1976744 0.3604651

tabyl can also get crosstabs or frequencies by groups:

#Tidyverse
penguins %>%
  tabyl(species, island)

##    species Biscoe Dream Torgersen
##     Adelie     44    56        52
##  Chinstrap      0    68         0
##     Gentoo    124     0         0

# Base R
table(penguins$species, penguins$island)

##            
##             Biscoe Dream Torgersen
##   Adelie        44    56        52
##   Chinstrap      0    68         0
##   Gentoo       124     0         0

You can easily add row and column totals like so:

#Tidyverse
penguins %>%
  tabyl(species, island) %>%
  adorn_totals(c("row", "col"))

##    species Biscoe Dream Torgersen Total
##     Adelie     44    56        52   152
##  Chinstrap      0    68         0    68
##     Gentoo    124     0         0   124
##      Total    168   124        52   344

#Base R
table <- tabyl(penguins$species, penguins$island) #notice the error

## Error in tabyl.default(penguins$species, penguins$island): The value supplied to the "show_na" argument must be TRUE or FALSE.
## 
## Did you try to call tabyl on two vectors, like tabyl(data$var1, data$var2) ? To create a two-way tabyl, the two vectors must be in the same data.frame, and the function should be called like this: 
## 
##          tabyl(data, var1, var2)
##          or
##          data %>% tabyl(var1, var2).  
## 
## See ?tabyl for more.

table <- tabyl(penguins, species, island)
adorn_totals(table, c("row", "col"))

##    species Biscoe Dream Torgersen Total
##     Adelie     44    56        52   152
##  Chinstrap      0    68         0    68
##     Gentoo    124     0         0   124
##      Total    168   124        52   344

Use adorn_ commands to use percentages, rounding, etc:

#Tidyverse
penguins %>%
  tabyl(species, island) %>%
  adorn_totals(c("row", "col")) %>%
  adorn_percentages() %>%
  adorn_rounding()

##    species Biscoe Dream Torgersen Total
##     Adelie    0.3   0.4       0.3     1
##  Chinstrap    0.0   1.0       0.0     1
##     Gentoo    1.0   0.0       0.0     1
##      Total    0.5   0.4       0.2     1

#Base R
table <- tabyl(penguins, species, island)
table <- adorn_totals(table, c("row", "col"))
table <- adorn_percentages(table)
adorn_rounding(table)

##    species Biscoe Dream Torgersen Total
##     Adelie    0.3   0.4       0.3     1
##  Chinstrap    0.0   1.0       0.0     1
##     Gentoo    1.0   0.0       0.0     1
##      Total    0.5   0.4       0.2     1

Histograms

You can easily see frequencies by making a histogram with the hist command:

hist(penguins$bill_depth_mm)

Read more, about descriptives in R Reproducible statistics for psychologists with R: Descriptives.

Data Manipulation

The following functions will help you change and manipulate your data.

Rename Variables

Use the rename("new name" = old_name) function:

names(penguins) #print variable names

## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"

#Tidyverse
penguins <- penguins %>%
  rename("weight" = body_mass_g)

names(penguins) #check that rename worked

## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "weight"           
## [7] "sex"               "year"

#Base R
names(penguins)[names(penguins)=="body_mass_g"] <- "weight"    #by variable name
names(penguins)[6] <- "weight"       #by column number

names(penguins) #check that rename worked

## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "weight"           
## [7] "sex"               "year"

Recode Variables

You can use recode inside a mutate (change) function to change how individual factor levels are assigned. For example, for penguin sex, there is female, male, and NA. Let’s change female to 0 and male to 1 in the variable sex 2.

#Tidyverse
penguins %>%
  count(sex)

##      sex   n
## 1 female 165
## 2   male 168
## 3   <NA>  11

penguins %>%
  mutate(male = recode(sex, 
                       male = 1, 
                       female = 0)) %>%
  select(sex, male) %>%   # select the two sex variables to compare
  head() #head lists the first 6 parts

##      sex male
## 1   male    1
## 2 female    0
## 3 female    0
## 4   <NA>   NA
## 5 female    0
## 6   male    1

#Base R
table(penguins$sex)

## 
## female   male 
##    165    168

penguins$male <- ifelse(penguins$sex=="male", 1, ifelse(penguins$sex=="female",0, NA))
table(penguins$male)

## 
##   0   1 
## 165 168

Create New Variables

We can use mutate to change and create variables. For example, in the penguins data set, weight is in grams. Let’s convert that to pounds. 1 gram = 0.00220462 pounds.

#Tidyverse
penguins <- penguins %>%
  mutate(weight_lbs = weight * 0.00220462)

#Base R
penguins$weight_lbs <- penguins$weight*0.00220462

For more practice, check out some common data manipulation tools in these interactive exercises.

Statistics

Correlations

Use cor from base R for most correlations (e.g., Pearson’s r, Kendall’s $\tau$, Spearman’s $\rho$). Arguments include “method” and “use”. Method has options for Pearson, Kendall, and Spearman correlations. If there is missing data, like below, you can use pairwise complete observations or only complete observations. Type ?cor in the console for more information.

Correlation Matrix

penguins %>%
  select(where(is.numeric)) %>% #selecting only numeric variables
  cor(method = "pearson", 
      use = "complete.obs") %>%
  round(3)

##                   bill_length_mm bill_depth_mm flipper_length_mm weight   year
## bill_length_mm             1.000        -0.229             0.653  0.589  0.033
## bill_depth_mm             -0.229         1.000            -0.578 -0.472 -0.048
## flipper_length_mm          0.653        -0.578             1.000  0.873  0.151
## weight                     0.589        -0.472             0.873  1.000  0.022
## year                       0.033        -0.048             0.151  0.022  1.000
## male                       0.344         0.373             0.255  0.425  0.000
## weight_lbs                 0.589        -0.472             0.873  1.000  0.022
##                    male weight_lbs
## bill_length_mm    0.344      0.589
## bill_depth_mm     0.373     -0.472
## flipper_length_mm 0.255      0.873
## weight            0.425      1.000
## year              0.000      0.022
## male              1.000      0.425
## weight_lbs        0.425      1.000

#Base R
cor(penguins, use="complete.obs") #notice the error 'x' must be numeric. This indicates that we have non-numeric variables (i.e., strings, factors)

## Error in cor(penguins, use = "complete.obs"): 'x' must be numeric

str(penguins) #anything that is not "num" or "int" can not be used in a correlation

## 'data.frame':    344 obs. of  10 variables:
##  $ species          : chr  "Adelie" "Adelie" "Adelie" "Adelie" ...
##  $ island           : chr  "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
##  $ weight           : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : chr  "male" "female" "female" NA ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ male             : num  1 0 0 NA 0 1 0 1 NA NA ...
##  $ weight_lbs       : num  8.27 8.38 7.17 NA 7.61 ...

#Filter using only numeric
round(cor(Filter(is.numeric, penguins), use="complete.obs"),3)

##                   bill_length_mm bill_depth_mm flipper_length_mm weight   year
## bill_length_mm             1.000        -0.229             0.653  0.589  0.033
## bill_depth_mm             -0.229         1.000            -0.578 -0.472 -0.048
## flipper_length_mm          0.653        -0.578             1.000  0.873  0.151
## weight                     0.589        -0.472             0.873  1.000  0.022
## year                       0.033        -0.048             0.151  0.022  1.000
## male                       0.344         0.373             0.255  0.425  0.000
## weight_lbs                 0.589        -0.472             0.873  1.000  0.022
##                    male weight_lbs
## bill_length_mm    0.344      0.589
## bill_depth_mm     0.373     -0.472
## flipper_length_mm 0.255      0.873
## weight            0.425      1.000
## year              0.000      0.022
## male              1.000      0.425
## weight_lbs        0.425      1.000

Correlation Coefficient

If you just need the correlation coefficient between two variables, you can correlate only those variables using cor(data$var1, data$var2, method="method", use="use"). Note the use of round() to round results to 3 decimal places.

#Tidyverse
cor(penguins$bill_length_mm, penguins$flipper_length_mm,
    method = "pearson", 
    use = "complete.obs") %>%
  round(3)

## [1] 0.656

#Base R
round(cor(penguins$bill_length_mm, penguins$flipper_length_mm, use = "complete.obs"),3)

## [1] 0.656

Correlation Test

You can use cor.test to get the correlation coefficient, a p-value, and a confidence interval for the correlation:

cor.test(penguins$bill_length_mm, penguins$flipper_length_mm,
    method = "pearson", 
    use = "complete.obs")

## 
##  Pearson's product-moment correlation
## 
## data:  penguins$bill_length_mm and penguins$flipper_length_mm
## t = 16.034, df = 340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5912769 0.7126403
## sample estimates:
##       cor 
## 0.6561813

Write-up of a Pearson r or point biserial correlation:

The results of a Pearson r correlation with variable 1 and variable 2 showed that these two variables were significantly positively related, $r=.\#\#,p = .\#\#, r^2 = .\#\#$.
The results of a Pearson r correlation between penguin bill length and flipper length showed that these two variables were significantly positively related, $r = .66$, $p < .001$, $r^2 = .42.$

Read more about correlation in R:

Correlation Test Between Two Variables in R

Reproducible statistics for psychologists with R: Correlation

Chi-Square Test of Independence

Recall that crosstabs can easily be displayed with tabyl as well as in Base R using table:

#Tidyverse and tabyl
penguins %>%
  tabyl(species, island)

##    species Biscoe Dream Torgersen
##     Adelie     44    56        52
##  Chinstrap      0    68         0
##     Gentoo    124     0         0

#Base R
table(penguins$species, penguins$island)

##            
##             Biscoe Dream Torgersen
##   Adelie        44    56        52
##   Chinstrap      0    68         0
##   Gentoo       124     0         0

To calculate a chi-square statistics, we can call janitor’s chisq.test function. Note that because base R also has a chisq.test function, we will use janitor::chisq.test(). Both packages function the same way, only the data is entered differently. (Base R’s chisq.test cannot be used with %>% pipes and so the data must be entered using $).

For either function, you must give it a crosstabs table and use chisq.test(crosstabs table).

# Tidyverse, tably, janitor
penguins %>%
  tabyl(species, island) %>%
  janitor::chisq.test()

## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 299.55, df = 4, p-value < 2.2e-16

#Base R
stats::chisq.test(penguins$species, penguins$island)

## 
##  Pearson's Chi-squared test
## 
## data:  penguins$species and penguins$island
## X-squared = 299.55, df = 4, p-value < 2.2e-16

Write-up of a Chi-square analysis:

The results of a chi-square test of independence with variable 1 and variable 2 showed that these two variables were significantly related, $\chi^2(df) = \#.\#\#, p = .\#\#\#$.
The results of a chi-square test of independence with penguin species and island location showed that these two variables were significantly related, $\chi^2(4) = 299.55, p < .001$.

Read more about chi square in R:

Chi-Square Test of Independence in R

Reproducible statistics for psychologists with R: Chi Square

Independent Samples t-test

We can use the t.test function from base R to conduct an independent t-test with one grouping variable (IV) that has 2 levels and one dependent variable (DV) that is continuous.

The t.test function has several arguments. For an independent-samples t-test, you can use: t.test(DV ~ IV, data=data, var.equal=FALSE)

Levene’s test of homoegeneity of variances

To assess whether variances are similar across groups, we can use the car package and the leveneTest(DV ~ IV, data=data) for this:

library(car)

leveneTest(weight_lbs ~ sex, # DV ~ IV
           data=penguins)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1  6.0586 0.01435 *
##       331                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result showed a significant difference in variances (i.e., they are not equal).

T-test

t.test(weight_lbs ~ sex, # DV ~ IV
       data=penguins,
       var.equal = FALSE) # Note we choose var.equal = FALSE because of the Levene's test above. FALSE is default so if we did not include this, it would default to a Welch's t-test

## 
##  Welch Two Sample t-test
## 
## data:  weight_lbs by sex
## t = -8.5545, df = 323.9, p-value = 4.794e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.853156 -1.160171
## sample estimates:
## mean in group female   mean in group male 
##             8.514844            10.021507

Optional: We can get a neater table of data by using tidy from the broom package:

#Optional 
library(broom)

t.test(weight_lbs ~ sex, # DV ~ IV
       data=penguins,
       paired=F,
       var.equal = F) %>%
  tidy()

## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic  p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
## 1    -1.51      8.51      10.0     -8.55 4.79e-16      324.    -1.85     -1.16
## # ... with 2 more variables: method <chr>, alternative <chr>

Estimate is the mean difference
Estimate 1 is female
Estimate 2 is male
Statistic is the t-value

Cohen’s d

# using psych package
psych::cohen.d(penguins[, c("weight_lbs", "sex")], "sex")

## Call: psych::cohen.d(x = penguins[, c("weight_lbs", "sex")], group = "sex")
## Cohen d statistic of difference between two means
##            lower effect upper
## weight_lbs   0.7   0.94  1.18
## 
## Multivariate (Mahalanobis) distance between groups
## [1] 0.94
## r equivalent of difference between two means
## weight_lbs 
##       0.42

Write-up of an Independent t-test:

The results of the independent t-test with XX as the independent variable and XX as the dependent variable showed that Group 1 (M=#.##, SD=#.##) reporting the dependent variable significantly more than Group 2 (M=#.##, SD=#.##), $t(df) = \#.\#\#, p = .\#\#\#, d = \#\#$.
The results of the independent t-test with sex as the independent variable and penguin weight as the dependent variable showed that Male penguins (M=10.0, SD=1.74) weigh significantly more than Females (M=8.5, SD=1.47), $t(323.9) = -8.55, p < .001, d = .95$.

Read more about independent t-tests in R:

Unpaired Two-Samples T-test in R

Reproducible statistics for psychologists with R: T-Tests

For more information on performing t-tests in R see: https://bookdown.org/ndphillips/YaRrr/t-test-t-test.html

Dependent Samples t-test

You can use the t-test command above, adding paired=T to the t.test command. The penguins data set does not have any paired data, so we will simulate some by adding random variation to each value:

penguins <- penguins %>%
  mutate(weight_lbs2 = weight_lbs + sample(-10:10, n(), replace = TRUE))

(paired <- t.test(penguins$weight_lbs, penguins$weight_lbs2, paired=T))

## 
##  Paired t-test
## 
## data:  penguins$weight_lbs and penguins$weight_lbs2
## t = -1.164, df = 341, p-value = 0.2453
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.006741  0.258203
## sample estimates:
## mean of the differences 
##               -0.374269

# paired-samples Cohen's d
difference <- penguins$weight_lbs - penguins$weight_lbs2
(cohens_d <- mean(difference, na.rm=T)/sd(difference, na.rm=T))

## [1] -0.06293927

Write-up of a Dependent t-test:

The results of the dependent t-test with XX as the independent variable and XX as the dependent variable showed that Pretest scores (M=#.##, SD=#.##) were significantly lower than Posttest scores (M=#.##, SD=#.##), $t(df) = \#.\#\#, p = .\#\#\#, d = \#\#$.
The results of the dependent t-test examining penguin weight at two time points showed no statistical difference, $t(341) = -0.544, p =.587, d = .06$.

Read more about dependent t-tests in R:

Paired Samples T-test in R

Reproducible statistics for psychologists with R: T-Tests

Oneway ANOVA

You can use aov from the stats package for a oneway between-subjects ANOVA, like this: aov(DV ~ group, data = data). We can then use summary.aov to get the results. Note: the aov function computes Type I Sums of Squares. If you want to repoduce the output you get from SPSS, you’ll need to use the car package and the Anova function to compute Type III Sums of Squares. For balanced designs, these give the same results. See the following for more information on ANOVA Sums of Squares calculation: https://www.r-bloggers.com/2011/03/anova-%E2%80%93-type-iiiiii-ss-explained/.

For more information on performing ANOVA in R see: https://bookdown.org/ndphillips/YaRrr/anova.html

weight_analysis <- aov(weight_lbs ~ island, data = penguins)
summary.aov(weight_analysis)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## island        2  419.5  209.76     110 <2e-16 ***
## Residuals   339  646.4    1.91                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness

# Type III Sums of Squares using the `Anova` function from the `car` package
(weight_analysis_2 <- car::Anova(lm(weight_lbs ~ island, data=penguins), type="III"))

## Anova Table (Type III tests)
## 
## Response: weight_lbs
##              Sum Sq  Df F value    Pr(>F)    
## (Intercept) 18052.4   1 9467.55 < 2.2e-16 ***
## island        419.5   2  110.01 < 2.2e-16 ***
## Residuals     646.4 339                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Welch's ANOVA
oneway.test(weight_lbs ~ island, data = penguins)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  weight_lbs and island
## F = 106.97, num df = 2.00, denom df = 150.41, p-value < 2.2e-16

# Levene's test
car::leveneTest(weight_lbs ~ island, data = penguins)

## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   2  25.064 7.053e-11 ***
##       339                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You can perform post-hoc tests using different functions. For example, a Tukey test:

TukeyHSD(weight_analysis)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weight_lbs ~ island, data = penguins)
## 
## $island
##                         diff        lwr       upr     p adj
## Dream-Biscoe     -2.21148681 -2.5968254 -1.826148 0.0000000
## Torgersen-Biscoe -2.22588447 -2.7459404 -1.705829 0.0000000
## Torgersen-Dream  -0.01439766 -0.5551373  0.526342 0.9978364

A pairwise t-test:

pairwise.t.test(penguins$weight_lbs, penguins$island,
                 p.adjust.method = "BH")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  penguins$weight_lbs and penguins$island 
## 
##           Biscoe Dream
## Dream     <2e-16 -    
## Torgersen <2e-16 0.95 
## 
## P value adjustment method: BH

To calculate an effect size, such as eta squared or partial eta squared, you can use the sjstats package. (Note: in an ANOVA model with one IV, eta squared = partial eta squared.)

library(sjstats)

eta_sq(weight_analysis, partial = T)

##     term partial.etasq
## 1 island         0.394

# Using effectsize package
effectsize::omega_squared(weight_analysis)

## # Effect Size for ANOVA
## 
## Parameter | Omega2 |       95% CI
## ---------------------------------
## island    |   0.39 | [0.32, 1.00]
## 
## - One-sided CIs: upper bound fixed at (1).

Write-up of a Oneway BS ANOVA:

There is a significant difference among the different groups of the independent variable, $F(df,f-statistic) = \#.\#\#, p = \#.\#\#, partial \eta^2= \#.\#\#$ . The results of the XXX posthoc test showed that Group 1 (M=#.##, SD=#.##) had higher levels of dependent variable compared to Group 2 (M=#.##, SD=#.##) and Group 3 (M=#.##, SD=#.##), d = ##.
There is a significant difference in penguin weight among penguins on different islands, $Welch's \: F(2,150.4) = 106.97, \: p < .0001, \: partial \: \eta^2=.39$. The results of the Tukey posthoc test showed that Biscoe penguins weighted more than Dream ($\Delta$M = 2.21, d = 1.6) and Torgersen penguins ($\Delta$M = 2.23, d = 1.6).

Saving

Saving your data

If you want to save the data and objects that you have loaded, cleaned, and analyzed, you may do so in two ways:

You can select the data objects and save them using save and saving it as a .Rdata file:

save(penguins, t_test_results, weight_analysis, file="my_data.Rdata")

You may load this data back into your R environment using load:

load("my_data.Rdata")

You can also save all the entire environment (workspace) using save.image:

save.image(file = "all_data.Rdata")

You can load it the same way as above.

You can also save data frames as CSV files with write.csv:

write.csv(penguins, file="penguins_data.csv")

Saving output

You can save your code and output together by compiling your document. Navigate to File -> Compile Document and choose HTML, Word, or PDF. Note: if there are any errors in your code, it will not compile until you fix them. Reading the error messages can help.

An Introduction to Data Analysis with R

EDPY 678 - Applied Stats III

2022-01-20

R Set Up

Packages

Installing Packages

Loading Packages

Loading Data

Data Objects

The Pipe %>%

Descriptive Statistics

Variable Names

Variable Types

Calculating Means

summary from base R

Summarize all variables

Summarize a single variable

describe from the psych package

Describe all variables

Describe a single variables

Describe by a group

skim from the skimr package

summarize from the tidyverse package

Summarize by a group

Frequencies

tabyl from the janitor package (https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html)

Histograms

Data Manipulation

Rename Variables

Recode Variables

Create New Variables

Statistics

Correlations

Correlation Matrix

Correlation Coefficient

Correlation Test

Write-up of a Pearson r or point biserial correlation:

Chi-Square Test of Independence

Write-up of a Chi-square analysis:

Independent Samples t-test

Levene’s test of homoegeneity of variances

T-test

Cohen’s d

Write-up of an Independent t-test:

Dependent Samples t-test

Write-up of a Dependent t-test:

Oneway ANOVA

Write-up of a Oneway BS ANOVA:

Saving

Saving your data

Saving output

The Pipe `%>%`

`summary` from base `R`

`describe` from the `psych` package

`skim` from the `skimr` package

`summarize` from the `tidyverse` package

`tabyl` from the `janitor` package (https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html)