This handout demonstrates how to conduct the statistical analyses covered in ESM 250.

R Set Up

It is recommended that you use an R Project for each course or major assignment. This will avoid having to set a working directory and write long file paths. You can move your files anywhere. As long as the R Project .rproj file is included, there should be no issues.

To create an R Project, simply go to File > New Project…. If you want to create a new directory, select New Directory. On the next screen, choose New Project, locate where you want to put your project, and click Create Project. If you have a directory already and you just want to make it into a project, click Existing Directory.

To access a folder within your R Project folder, you can use a simple path such as read.csv("data/data.csv"). To access a file outside of your R Project folder, you can use ..: read.csv("../other_data/data.csv").

Packages

Installing Packages

Packages are installed by typing install.packages("package name") in the R console. To replicate the examples in this document, you will need to install the following packages (Note: you only need to install packages one time):

install.packages("janitor")
install.packages("effectsize")
install.packages("ggplot2")

Loading Packages

Loading a package can be done with the following code: library(package name). It is recommended to load all packages at the top of your script file.

Loading Data

You can load many different data types using R. In this class, we will mainly use data stored as a CSV file:

CSV files with read.csv from the base R package:
- read.csv("name_of_file.csv")
- read.csv("http://www.website.com/internet/file.csv)

Throughout this tutorial, we will use data from the High School and Beyond survey from the National Center for Education Statistics. A codebook for the data is provided in the following table.

Variable Name	Description	Values
id	identification number	Range: 1-200
female	gender	male, female
race	race/ethnicity	white, african-amer, hispanic, asian
ses	socioeconomic status	low, middle, high
schtyp	school type	public, private
prog	type of school program	general, vocational, academic
read	Standardized measure of reading ability	Range: 28-76
write	Standardized measure of writing ability	Range: 31-67
math	Standardized measure of mathematics literacy	Range: 33-75
science	Standardized measure of science literacy	Range: 26-74
socst	Standardized measure of social studies literacy	Range: 26-71

The High School and Beyond data may be loaded in R using the following code:

dat <- read.csv("https://lrocconi.github.io//files//hsb2.csv")

Loading data from Excel and other statistical software packages

Excel files with read_excel from the readxl package (you may need to modify function defaults for read_excel to work properly). :
- read_excel("name_of_file.xlsx")
- Personally, I like to convert Excel files to a CSV file and load the CSV file into R
The haven package reads data from other statistical software packages:
- read_spss("name_of_file.sav") or read_sav reads SPSS data files
- read_dta reads Stata data files
- read_sas reads SAS data files
- haven can also save a data set as an SPSS, Stata, or SAS data file using the write_sav, write_dta, and write_sas functions

Data Objects

R is an object-oriented program language. That means that results are stored as objects that you can retrieve and manipulate later. As such, you save data and results as R objects. You can use = or <-. However, <- is recommended.

Note: c() means “combine” and # allows you to make comments in your code.

# character vector
phrase <- "Good luck in Stats" 

#numeric vector
answers <- c(6*5, 6^2, 5*sqrt(23))

# a data frame
dataframe <- data.frame(gender = c("male", "female", "female", "male"),
                        age = c(18, 19, 19, 20))

# assign data to an object named dat
dat <- read.csv("https://lrocconi.github.io//files//hsb2.csv")

# assign results from a t-test to t_test_results
t_test_results <- t.test(read ~ female, data=dat)

Basic R Operations

Subsetting

The $ operator is used to subset or name an element of a list. It is used to specify a specific variable from a dataframe.

The Pipe `|>`

In some of the code below you will see the pipe operator: |>. This is a function that allows you to connect multiple commands together without the use of multiple parentheses. It stands for “and then”. You can easily connected multiple commands in a single pipe chain.

Descriptive Statistics

Variable Names

Quickly view the column names (i.e., variables) in a dataframe with the names function:

names(dat)

##  [1] "id"      "female"  "race"    "ses"     "schtyp"  "prog"    "read"   
##  [8] "write"   "math"    "science" "socst"

Variable Types

View the types of variables using str:

str(dat)

## 'data.frame':    200 obs. of  11 variables:
##  $ id     : int  70 121 86 141 172 113 50 11 84 48 ...
##  $ female : chr  "male" "female" "male" "male" ...
##  $ race   : chr  "white" "white" "white" "white" ...
##  $ ses    : chr  "low" "middle" "high" "high" ...
##  $ schtyp : chr  "public" "public" "public" "public" ...
##  $ prog   : chr  "general" "vocation" "general" "vocation" ...
##  $ read   : int  57 68 44 63 47 44 50 34 63 57 ...
##  $ write  : int  52 59 33 44 52 52 59 46 57 55 ...
##  $ math   : int  41 53 54 47 57 51 42 45 54 52 ...
##  $ science: int  47 63 58 53 53 63 53 39 58 50 ...
##  $ socst  : int  57 61 31 56 61 61 61 36 51 51 ...

Variable types:

num - numeric
int - integer (1L, 2L; same as numeric but no decimals)
dbl - double (double precision; same as numeric)
factor - categorical
chr - character text
logical - TRUE/FALSE

Summary Information

`summary` from Base R

For numeric variables, this will return quartiles, median, means, mix, max, and NAs. Use the $ dollar sign to refer to a specific variable.

summary(dat$read)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   28.00   44.00   50.00   52.23   60.00   76.00

`describe` from the `psych` package

You can load the psych package and use describe to get mean, sd, median, and other variables. Note: if you haven’t installed the psych package run install.packages("psych") before running code below.

Describe all variables

psych::describe(dat)

##         vars   n   mean    sd median trimmed   mad min max range  skew kurtosis
## id         1 200 100.50 57.88  100.5  100.50 74.13   1 200   199  0.00    -1.22
## female*    2 200   1.46  0.50    1.0    1.44  0.00   1   2     1  0.18    -1.98
## race*      3 200   3.47  0.98    4.0    3.71  0.00   1   4     3 -1.68     1.35
## ses*       4 200   2.18  0.86    2.0    2.23  1.48   1   3     2 -0.36    -1.55
## schtyp*    5 200   1.84  0.37    2.0    1.93  0.00   1   2     1 -1.84     1.40
## prog*      6 200   1.73  0.84    1.0    1.66  0.00   1   3     2  0.55    -1.37
## read       7 200  52.23 10.25   50.0   52.03 10.38  28  76    48  0.19    -0.66
## write      8 200  52.77  9.48   54.0   53.36 11.86  31  67    36 -0.47    -0.78
## math       9 200  52.65  9.37   52.0   52.23 10.38  33  75    42  0.28    -0.69
## science   10 200  51.85  9.90   53.0   52.02 11.86  26  74    48 -0.19    -0.60
## socst     11 200  52.41 10.74   52.0   52.99 13.34  26  71    45 -0.38    -0.57
##           se
## id      4.09
## female* 0.04
## race*   0.07
## ses*    0.06
## schtyp* 0.03
## prog*   0.06
## read    0.72
## write   0.67
## math    0.66
## science 0.70
## socst   0.76

Describe by a group

We can use describeBy(numeric_variable, group=grouping_variable) from the psych package if you want the descriptive statistics for individual groups:

psych::describeBy(dat$read, group=dat$prog)

## 
##  Descriptive statistics by group 
## group: academic
##    vars   n  mean   sd median trimmed   mad min max range skew kurtosis   se
## X1    1 105 56.16 9.59     55      56 11.86  34  76    42  0.1    -0.87 0.94
## ------------------------------------------------------------ 
## group: general
##    vars  n  mean   sd median trimmed   mad min max range skew kurtosis   se
## X1    1 45 49.76 9.23     50   49.68 10.38  28  68    40 0.09    -0.61 1.38
## ------------------------------------------------------------ 
## group: vocation
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 50 46.2 8.91     47   45.52 7.41  31  68    37 0.56    -0.16 1.26

Individual Functions

To compute specific descriptive statistics you can call the function directly using mean for mean, median for median, sd for standard deviation, IQR for interquartile range, andmad for median absolute deviation.

mean(dat$read)

## [1] 52.23

sd(dat$read)

## [1] 10.25294

IQR(dat$read)

## [1] 16

mad(dat$read)

## [1] 10.3782

Note: There is no function to compute mode for a nominal variable in R. To get the mode (i.e., the most frequently occurring value), you need to save out a frequency table and sort the frequency table.

tab <- table(dat$read) #number of occurrences for each unique value
sort(tab, decreasing = TRUE) # sort highest to lowest

## 
## 47 50 63 52 57 42 44 55 68 60 65 39 34 73 36 37 41 43 45 71 76 28 31 35 46 48 
## 27 18 16 14 14 13 13 13 11  9  9  8  6  5  3  2  2  2  2  2  2  1  1  1  1  1 
## 53 54 61 66 
##  1  1  1  1

So 47 is the mode, or the most frequently occurring value for read.

Frequency Tables

`table` in Base R

In Base R, you can use the table function to print a frequency table.

table(dat$prog)

## 
## academic  general vocation 
##      105       45       50

However, notice that the table only give you frequency counts. If you want the proportions in each category, you need to wrap the function in prop.table.

prop.table(table(dat$prog))

## 
## academic  general vocation 
##    0.525    0.225    0.250

`tabyl` from the `janitor` package

The tabyl from the janitor package creates much nicer frequency tables and automatically includes a column for proportion.

tabyl(dat$prog)

##  dat$prog   n percent
##  academic 105   0.525
##   general  45   0.225
##  vocation  50   0.250

tabyl can also get crosstabs or frequencies by groups:

# janitor package
dat |>
  tabyl(prog, female)

##      prog female male
##  academic     58   47
##   general     24   21
##  vocation     27   23

# Base R
table(dat$prog, dat$female)

##           
##            female male
##   academic     58   47
##   general      24   21
##   vocation     27   23

Using the janitor package, you can easily add row and column totals like so:

dat |>
  tabyl(prog, female) %>%
  adorn_totals(c("row", "col"))

##      prog female male Total
##  academic     58   47   105
##   general     24   21    45
##  vocation     27   23    50
##     Total    109   91   200

Or use adorn_ commands to include percentages, rounding, etc:

dat |>
  tabyl(prog, female) |>
  adorn_percentages() |>
  adorn_rounding(digits = 3)

##      prog female  male
##  academic  0.552 0.448
##   general  0.533 0.467
##  vocation  0.540 0.460

Graphing Data

There are two ways to main ways to graphs in R: Base R or ggplot2. I will use Base R to create a quick graph to visualize my data and useggplot2 when I want more customization options for my graph or when I want to create publication quality graphs. Below I will show you both ways to create some common graphs

Bar chart

In Base R, you can save your frequency table as an object and use the barplot function to graph the frequency distribution.

Let’s create a bar chart for the program type variable.

freq_prog <- table(dat$prog)
barplot(freq_prog)

You can add some customization to the chart using the following options.

freq_prog <- table(dat$prog)
barplot(freq_prog, col = "blue", main = "Students by Program Type", xlab = "Program Type", ylab = "Number of Students")

Let’s create the same graph but using the ggplot2 package. The main function from ggplot2 is named ggplot. The basic syntax for ggplot2 is as follows: ggplot(data = data_frame) + geom_bar(aes(x = variable_name, fill = variable_name))

ggplot(data = dat) + 
  geom_bar(aes(x=prog, fill = prog)) +
  labs(title = "Students by Program Type", x = "Program Type", y = "Number of Students")

Side-by-side bar graph

To create a side-by-side bar chart using base R or ggplot.

# Base R
crosstab <- table(dat$female, dat$prog)
barplot(crosstab, beside = TRUE, col = c("pink", "lightblue"), legend = rownames(crosstab), main = "Students by Program Type and Gender")

# ggplot
ggplot(data=dat, aes(x = prog, fill = female)) +
  geom_bar(position = "dodge") +
  labs(title = "Students by Program Type and Gender", x = "Program Type", y = "Count") +
  theme_minimal()

Histograms

You can easily see frequencies by making a histogram with the hist command:

hist(dat$math)

Here is how to do the same thing using ggplot.

ggplot(data = dat) + 
  geom_histogram(aes(x = math), bins = nclass.Sturges(dat$math)) +
  labs(title = "Histogram of Math Scores", x = "Math Score", y = "Number of Students")

Boxplots

The boxplot function in Base R can be used to create quick blockplots. You can also add a title and coloring using the “main” and “col” arguments.

boxplot(dat$math, main = "Math Scores", col = "lightblue")

Here’s how you create the same boxplot in ggplot

ggplot(dat, aes(y = math)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Math Scores", y = "Math Score") +
  theme_minimal()

Here is how to create side-by-side boxplots in Base R and ggplot.

# Base R
boxplot(dat$math ~ dat$female, col = c("pink","blue"))

# ggplot
ggplot(dat, aes(x = female, y = math, fill = female)) +
  geom_boxplot() +
  scale_fill_manual(values = c("pink", "blue")) +
  labs(x = "Female", y = "Math Score") +
  theme_minimal()

Frequency Polygon

To create a frequency polygon in Base R, first save out a histogram as an object. Then, plot the histogram with the option type = "l".

hist_data <- hist(dat$read)

plot(hist_data$mids, hist_data$counts, type = "l", main = "Frequency Polygon of Reading Scores", xlab = "Reading Score", ylab = "Frequency")
lines(hist_data$mids, hist_data$counts, col = "blue")

Here is the same frequency polygon in ggplot

ggplot(dat, aes(x = read)) +
  geom_freqpoly(binwidth = 5) +
  labs(title = "Frequency Polygon of Reading Scores", x = "Reading Score", y = "Frequency") +
  theme_minimal()

Scatterplots

Use the plot function in Base R to create a scatterplot; then use the abline function to add a regression line to the plot.

plot(dat$read, dat$math)
abline(lm(math ~ read, data = dat), col = "blue")

In ggplot, create a scatterplot with a regression line using the following code:

ggplot(dat, aes(x = read, y = math)) + 
  geom_point() +
  geom_smooth(method = lm)

## `geom_smooth()` using formula = 'y ~ x'

Additional Graphing Information

For additional information on colors in R: - https://bookdown.org/hneth/ds4psy/D-3-apx-colors-basics.html - https://r-graph-gallery.com/colors.html

For additional information on using ggplot2: - https://rstudio.github.io/cheatsheets/html/data-visualization.html - https://ggplot2.tidyverse.org/reference/

Statistical Analyses

Correlations

Use cor function for most correlations (e.g., Pearson’s r, Kendall’s $\tau$, Spearman’s $\rho$). Arguments include “method” and “use”. Method has options for Pearson, Kendall, and Spearman correlations. If your data have any missing data, you need to add the “use” option to tell R how to deal with missing data. Type ?cor in the console for more information.

cor(dat$read, dat$math, method = "pearson")

## [1] 0.6622801

cor(dat$read, dat$math, method = "spearman")

## [1] 0.663126

Correlation Matrix

To get a correlation matrix, you first need to subset your data to keep on the numeric variables, then run the cor function on this new object.

# save out numeric variables
dat_numeric <- dat[,c("read", "write", "science", "socst")]
cor(dat_numeric)

##              read     write   science     socst
## read    1.0000000 0.5967765 0.6301579 0.6214843
## write   0.5967765 1.0000000 0.5704416 0.6047932
## science 0.6301579 0.5704416 1.0000000 0.4651060
## socst   0.6214843 0.6047932 0.4651060 1.0000000

Correlation Test

You can use cor.test to get the correlation coefficient, a p-value, and a confidence interval for the correlation:

cor.test(dat$read, dat$write,
    method = "pearson", 
    use = "complete.obs")

## 
##  Pearson's product-moment correlation
## 
## data:  dat$read and dat$write
## t = 10.465, df = 198, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4993831 0.6792753
## sample estimates:
##       cor 
## 0.5967765

Regression line

To get the regression line from R, use the lm function. Give it your dependent variable (i.e, the variable you want to predict) and your independent variable, seperated by a tilde ~.

lm(math ~ read, data = dat)

## 
## Call:
## lm(formula = math ~ read, data = dat)
## 
## Coefficients:
## (Intercept)         read  
##     21.0382       0.6051

Thus, the regression equation would be $\hat{Math} = 21.0382 + 0.6051*Read$.

One-sample t-test

Let’s test the null hypothesis that the mean math achievement score in the population is 50. We can use the t.test function for a one-sample t-test, an independent samples t-test, and a paired-samples t-test. For the one-sample t-test, we give the function our variable of interest and our hypothesized population parameter in the argument mu.

t.test(dat$math, mu = 50)

## 
##  One Sample t-test
## 
## data:  dat$math
## t = 3.9928, df = 199, p-value = 9.184e-05
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
##  51.33868 53.95132
## sample estimates:
## mean of x 
##    52.645

We can use the effectsize package to compute effect sizes for our statistical tests.

effectsize::cohens_d(dat$math, mu = 50)

## Cohen's d |       95% CI
## ------------------------
## 0.28      | [0.14, 0.42]
## 
## - Deviation from a difference of 50.

Independent Samples t-test

We can use the t.test function to conduct an independent t-test with one grouping variable (IV) that has 2 levels and one dependent variable (DV) that is continuous. For an independent-samples t-test, you can use: t.test(DV ~ IV, data=data)

Levene’s test of homoegeneity of variances

To assess whether variances are similar across groups, we can use the leveneTest(DV ~ IV, data=data) function from the car package:

car::leveneTest(science ~ female, 
                data=dat)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1  3.0015 0.08475 .
##       198                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results did not show a significant difference in variances (i.e., we meet the assumption).

Independent-samples t-test

t.test(science ~ female, 
       data=dat,
       var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  science by female
## t = -1.8124, df = 198, p-value = 0.07144
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -5.290207  0.223164
## sample estimates:
## mean in group female   mean in group male 
##             50.69725             53.23077

# Note that I had to specify `var.equal = TRUE` because FALSE is the default, so if we did not include this, it would default to a Welch's t-test

To compute effect sizes, we can use the effectsize package.

# standardized mean difference
effectsize::cohens_d(science ~ female, data = dat)

## Cohen's d |        95% CI
## -------------------------
## -0.26     | [-0.54, 0.02]
## 
## - Estimated using pooled SD.

# proportion of variance explained
effectsize::omega_squared(aov(science~female, data = dat))

## For one-way between subjects designs, partial omega squared is
##   equivalent to omega squared. Returning omega squared.

## # Effect Size for ANOVA
## 
## Parameter | Omega2 |       95% CI
## ---------------------------------
## female    |   0.01 | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

Paired Samples t-test

Since the High School and Beyond dataset does not contain any repeated-measures data, I will illustrate how to perform a paired-sample test with the following example and data.

A major oil company would like to improve its tarnished image following a large oil spill. Its marketing department develops a short television commercial and test it on a sample of n = 7 subjects. People’s attitudes about the company are measured with a short questionnaire both before and after viewing the commercial where higher values indicate better perceptions of the company. The data are as follows. Do the data indicate that the commercial would improve the image? Use an alpha = .10. Participant Perceptions (scores range from 0 to 20).

Before	After
15	15
11	13
10	18
11	12
14	16
10	10
11	19

# enter data into R
paired_data <- data.frame(before = c(15, 11, 10, 11, 14, 10, 11),
                          after = c(15, 13, 18, 12, 16, 10, 19)) 

# Let's print the dataframe to ensure we have entered the data correctly 
paired_data

##   before after
## 1     15    15
## 2     11    13
## 3     10    18
## 4     11    12
## 5     14    16
## 6     10    10
## 7     11    19

# t-test
t.test(paired_data$after, paired_data$before, paired = TRUE, conf.level = .90)

## 
##  Paired t-test
## 
## data:  paired_data$after and paired_data$before
## t = 2.2601, df = 6, p-value = 0.06454
## alternative hypothesis: true mean difference is not equal to 0
## 90 percent confidence interval:
##  0.4206854 5.5793146
## sample estimates:
## mean difference 
##               3

# effect size
effectsize::cohens_d(paired_data$after, paired_data$before, paired = TRUE, ci = .90)

## For paired samples, 'repeated_measures_d()' provides more options.

## Cohen's d |       90% CI
## ------------------------
## 0.85      | [0.09, 1.56]

Chi-square Goodness of Fit

Let’s use an example to illustrate the use of a chi-square goodness of fit test.

A researcher is interested in the ability of people to recall their dreams, so the researcher observes 80 participants as they slept overnight in a laboratory. As part of the study, the research woke participants during the REM stages of their sleep and asked participants whether or not they were dreaming. The data showing dream recall is below:

Did Recall Dream	Did Not Recall Dream	Unsure
58	12	10

Based on previous studies, the researcher expects 80% of participants to recall their dream, 10% to not recall their dream, and 10% to be unsure. Do these data follow the same distribution as previous research? Test this hypothesis using an alpha = .05.

# Enter data into R 
dream <- c(58, 12, 10) 

# add names 
names(dream) <- c("Did recall", "Did not", "Unsure") 

# Perform chi-square test 
dream_results <- chisq.test(dream, p = c(.8, .1, .1)) 

# to view results 
dream_results

## 
##  Chi-squared test for given probabilities
## 
## data:  dream
## X-squared = 3.0625, df = 2, p-value = 0.2163

# to view residuals
dream_results$stdres

## Did recall    Did not     Unsure 
##  -1.677051   1.490712   0.745356

Chi-Square Test of Independence

Let’s examine whether males and females differ in the types of academic programs they are enrolled.

Recall that crosstabs can easily be displayed with tabyl as well as in Base R using table:

#Tidyverse and tabyl
dat |> 
  tabyl(prog, female)

##      prog female male
##  academic     58   47
##   general     24   21
##  vocation     27   23

#Base R
table(dat$prog, dat$female)

##           
##            female male
##   academic     58   47
##   general      24   21
##   vocation     27   23

To calculate a chi-square statistics, we can call the chisq.test function from either the janitor package or Base R. Note that because base R also has a chisq.test function, we will use janitor::chisq.test(). Both packages function the same way, only the data is entered differently.

For either function, you must give it a crosstabs table and use chisq.test(crosstabs table).

# Using tably from janitor
dat |>
  tabyl(prog, female) |>
  janitor::chisq.test()

## 
##  Pearson's Chi-squared test
## 
## data:  tabyl(dat, prog, female)
## X-squared = 0.052809, df = 2, p-value = 0.9739

#Base R
stats::chisq.test(dat$prog, dat$female)

## 
##  Pearson's Chi-squared test
## 
## data:  dat$prog and dat$female
## X-squared = 0.052809, df = 2, p-value = 0.9739

Optional: Oneway ANOVA

You can use aov from the stats package for a oneway between-subjects ANOVA, like this: aov(DV ~ group, data = data). We can then use summary.aov to get the results. Note: the aov function computes Type I Sums of Squares. If you want to reproduce the output you get from other statistical software packages such as SPSS and SAS, you’ll need to use the car package and the Anova function to compute Type III Sums of Squares. For balanced designs, these give the same results. See the following for more information on ANOVA Sums of Squares calculation: https://www.r-bloggers.com/2011/03/anova-%E2%80%93-type-iiiiii-ss-explained/.

ses_analysis <- aov(read ~ ses, data = dat)
summary.aov(ses_analysis)

##              Df Sum Sq Mean Sq F value  Pr(>F)    
## ses           2   1832   916.2   9.456 0.00012 ***
## Residuals   197  19087    96.9                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Type III Sums of Squares using the `Anova` function from the `car` package
(ses_analysis_2 <- car::Anova(lm(read ~ ses, data=dat), type="III"))

## Anova Table (Type III tests)
## 
## Response: read
##             Sum Sq  Df  F value    Pr(>F)    
## (Intercept) 185150   1 1910.962 < 2.2e-16 ***
## ses           1832   2    9.456 0.0001199 ***
## Residuals    19087 197                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Welch's ANOVA
oneway.test(read ~ ses, data = dat)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  read and ses
## F = 8.7397, num df = 2.00, denom df = 107.24, p-value = 0.0003046

# Levene's test
car::leveneTest(read ~ ses, data = dat)

## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   2  2.5538 0.08036 .
##       197                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You can perform post-hoc tests using different functions. For example, a Tukey test:

TukeyHSD(ses_analysis)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = read ~ ses, data = dat)
## 
## $ses
##                  diff         lwr       upr     p adj
## low-high    -8.223404 -12.7855387 -3.661270 0.0000948
## middle-high -4.921053  -8.7945768 -1.047528 0.0085237
## middle-low   3.302352  -0.8430804  7.447784 0.1468044

A pairwise t-test with a Benjamini-Hochberg adjustment (i.e., false discovery rate):

pairwise.t.test(dat$read, dat$ses,
                 p.adjust.method = "BH")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  dat$read and dat$ses 
## 
##        high    low   
## low    9.6e-05 -     
## middle 0.0046  0.0614
## 
## P value adjustment method: BH

To calculate an effect size, such as eta squared or partial eta squared, you can use the effectsize package

effectsize::omega_squared(ses_analysis)

## # Effect Size for ANOVA
## 
## Parameter | Omega2 |       95% CI
## ---------------------------------
## ses       |   0.08 | [0.02, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

Saving

Saving your data

If you want to save the data and objects that you have loaded, cleaned, and analyzed, you may do so in two ways:

You can select the data objects and save them using save and saving it as a .Rdata file:

save(dat, dat_numeric, paired_data, file="my_data.Rdata")

You may load this data back into your R environment using load:

load("my_data.Rdata")

You can also save all the entire environment (workspace) using save.image:

save.image(file = "all_data.Rdata")

You can load it the same way as above.

You can also save data frames as CSV files with write.csv:

write.csv(dat_numeric, file="numeric_data.csv")

Saving output

You can save your code and output together by compiling your document. Navigate to File -> Compile Document and choose HTML, Word, or PDF. Note: if there are any errors in your code, it will not compile until you fix them. Reading the error messages can help.

Data Analysis with R

ESM 250: Applied Statistical Concepts Dr. Louis Rocconi

2024-10-30