1 Visualizing functions with ggplot2

ggplot2 is an R package for producing statistical, or data, graphics. Unlike most other graphics packages, ggplot2 has an underlying grammar, based on the Grammar of Graphics (Wilkinson 2005) Wilkinson, Leland. 2005. The Grammar of Graphics. 2nd ed. Statistics and Computing. Springer. , that allows you to compose graphs by combining independent components. More on the structure and use of ggplot2 here: https://ggplot2-book.org/introduction

Here, we will use ggplot2 to visualize and explore the normal distribution. Material and ideas for this chapter are taken from https://ggplot2tutor.com/tutorials/sampling_distributions. Loading the tidyverse library, which includes ggplot2 and many more packages that are needed for the following plots:

# R
library(tidyverse)

1.1 Creating curves

We use stat_function to visualize arbitrary functions. For example, we draw a simple polynomial on an empty plot.

# R
ggplot(data.frame(x = c(-20, 20)), aes(x)) +
  stat_function(fun = function(x) { x**3 },
                geom = "line")

We use this way to draw curves without data. We create a dataframe that defines the limits of the visualization. The first argument of stat_function is fun, which stands for function. We can make up functions from scratch (as in the example above) or use predefined functions like dnorm to visualize the normal distribution.

# R
ggplot(data.frame(x = c(-3, 3)), aes(x)) +
  stat_function(fun = dnorm,
                geom = "line")

1.2 Creating areas under the curves

With geom we tell ggplot how to visualize our function. The most common visualizations are an area, a line or points. We are interested in areas under the curve, as we have seen in theory. Let’s do that, by just changing the geom parameter to area.

# R
ggplot(data.frame(x = c(-3, 3)), aes(x)) +
  stat_function(fun = dnorm,
                geom = "area")

We can change the color of the area.

# R
ggplot(data.frame(x = c(-3, 3)), aes(x)) +
  stat_function(fun = dnorm,
                geom = "area",
                fill = "darkseagreen")

1.3 Visualize functions with arguments

We are interested to draw curves with certain function arguments. In stat_function we can add the args argument.

# R
ggplot(data.frame(x = c(20, 180)), aes(x)) +
  stat_function(fun = dnorm,
                geom = "area",
                fill = "steelblue",
                args = list(
                  mean = 100,
                  sd = 20
                ))

1.4 Adding multiple functions

Since ggplot2 follows the grammar of graphics, we can easily stack functions on top of each other. We can display two normal distributions with different mean values. We add an alpha level to the upper distribution to make the lower distribution visible.

# R
ggplot(data.frame(x = c(0, 180)), aes(x)) +
  stat_function(fun = dnorm,
                geom = "area",
                fill = "darkgrey",
                args = list(
                  mean = 100,
                  sd = 20
                )) +
  stat_function(fun = dnorm,
                geom = "area",
                fill = "steelblue",
                alpha = .8,
                args = list(
                  mean = 70,
                  sd = 20
                ))

We can also draw two normal distrbutions with the same mean and different variance / standard deviation

# R
ggplot(data.frame(x = c(-30, 230)), aes(x)) +
  stat_function(fun = dnorm,
                geom = "area",
                fill = "darkgrey",
                args = list(
                  mean = 100,
                  sd = 20
                )) +
  stat_function(fun = dnorm,
                geom = "area",
                fill = "steelblue",
                alpha = .8,
                args = list(
                  mean = 100,
                  sd = 45
                ))

1.5 Shading a part of the area under the curve

Often, we are only interested in the area under a curve for certain limits on the x-axis. For this we need the xlim argument. We will shade the the right half of the normal distribution.

# R
ggplot(data.frame(x = c(-3, 3)), aes(x)) +
  stat_function(fun = dnorm,
                geom = "line",
                xlim = c(-3, 3)) +
  stat_function(fun = dnorm,
                geom = "area",
                fill = "steelblue",
                xlim = c(0, 3)) +
  xlim(-3, 3)

We can draw the limits according to the confidence intervals, for example the 95% CI from the right side of the curve, using the qnorm(.95) expression.

# R
ggplot(data.frame(x = c(-3, 3)), aes(x)) +
  stat_function(fun = dnorm,
                geom = "line",
                xlim = c(-3, 3)) +
  stat_function(fun = dnorm,
                geom = "area",
                fill = "steelblue",
                xlim = c(qnorm(.95), 3)) +
  xlim(-3, 3)

We can do the same with the 95% CI for both sides of the curve, using qnorm(.025) and qnorm(.975) with xlim. Note that this is the standard normal distribution, with mean=0 and SD=1 and for this reason the qnorm(.975)=1,96 expression marks the x-value where the CI begins, since it is multiplied by 1(SD) and added to the mean(0).

# R
ggplot(data.frame(x = c(-3, 3)), aes(x)) +
  stat_function(fun = dnorm,
                geom = "line",
                xlim = c(-3, 3)) +
  stat_function(fun = dnorm,
                geom = "area",
                fill = "steelblue",
                xlim = c(qnorm(.975), 3)) +
  stat_function(fun = dnorm,
                geom = "area",
                fill = "steelblue",
                xlim = c(-3, qnorm(.025))) +
  xlim(-3, 3)

2 Simulated Data

2.1 Creating a histogram and compare it with the normal curve

Generate data from 1000 observations from a normal distribution with mean 10 and standard deviation 1

# R
generated_data <- data.frame(normal_data = rnorm(1000, 10, 1))

Compare to histogram

# R
ggplot(generated_data, aes(x = normal_data)) +
      geom_histogram(colour = "black", aes(y = after_stat(density)))  +
      geom_function(fun = dnorm, colour = "red", linewidth = 2,
                 args = list(mean = mean(generated_data$normal_data), 
                                sd = sd(generated_data$normal_data)))

Do the same with 100,000 onservations. See that the distribution fits better the normal distribution curve

# R
generated_data2 <- data.frame(normal_data = rnorm(100000, 10, 1))

ggplot(generated_data2, aes(x = normal_data)) +
      geom_histogram(colour = "black", aes(y = after_stat(density)))  +
      geom_function(fun = dnorm, colour = "red", linewidth = 2,
                 args = list(mean = mean(generated_data2$normal_data), 
                                sd = sd(generated_data2$normal_data)))

2.2 Q-Q plots

To draw a Q-Q plot for a normal distribution, we use the geoms geom_qq and geom_qq_line. First for 1000 observations.

# R
ggplot(generated_data, aes(sample = normal_data)) +
        geom_qq() + geom_qq_line()

Then we do the same for 100.000 observations. Which case shows a better fit with normality?

# R
ggplot(generated_data2, aes(sample = normal_data)) +
        geom_qq() + geom_qq_line()

3 Using an R dataset

We will use the iris dataset. Numerous guides have been written on the exploration of this widely known dataset. Iris, introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems, contains three plant species (setosa, virginica, versicolor) and four features measured for each sample. These quantify the morphologic variation of the iris flower in its three species, all measurements given in centimeters.

# R
head(iris)
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

3.1 Compare the histogram with the normal curve

Compare to the normal curve with the histogram of the actual data for Sepal.Width. There are some deviations from the normal distribution, what do you think?

# R
ggplot(iris, aes(x = Sepal.Width)) +
      geom_histogram(colour = "black", bins = 22, aes(y = after_stat(density)))  +
      geom_function(fun = dnorm, colour = "red", linewidth = 2,
                 args = list(mean = mean(iris$Sepal.Width), 
                                sd = sd(iris$Sepal.Width)))

3.2 Q-Q plot

We draw a Q-Q plot for a comparison with the normal distribution.

# R
ggplot(iris, aes(sample = Sepal.Width)) +
        geom_qq() + geom_qq_line()

3.3 Different curves for different species

Now we will explore the distribution of the same trait for each species separately, to see if our data are differentiated among species. The histogram is for all data and the curves are the normal distribution curves for each species. Does this figure explain the shape of the histogram?

# R
ggplot(iris, aes(x = Sepal.Width)) +
      geom_histogram(colour = "black", bins = 22, aes(y = after_stat(density)))  +
      geom_function(fun = dnorm, colour = "red", linewidth = 2,
                 args = list(mean = mean(iris$Sepal.Width[1:50]), 
                                sd = sd(iris$Sepal.Width[1:50]))) +
 geom_function(fun = dnorm, colour = "green", linewidth = 2,
                 args = list(mean = mean(iris$Sepal.Width[51:100]), 
                                sd = sd(iris$Sepal.Width[51:100]))) +
 geom_function(fun = dnorm, colour = "blue", linewidth = 2,
                 args = list(mean = mean(iris$Sepal.Width[101:150]), 
                                sd = sd(iris$Sepal.Width[101:150])))

The above figure needs a legend to describe which species is represented by which curve. For this reason we need to replace geom_function with stat_function and use aes(colour = …) inside stat_function to map the colors. Then we adjust scale_colour_manual() at the end of the command to to use named colors corresponding to the species.

# R
ggplot(iris, aes(x = Sepal.Width)) +
  geom_histogram(colour = "black", bins = 22, aes(y = after_stat(density))) +
  stat_function(fun = dnorm, aes(colour = "setosa"), linewidth = 2,
                args = list(mean = mean(iris$Sepal.Width[1:50]), 
                            sd = sd(iris$Sepal.Width[1:50]))) +
  stat_function(fun = dnorm, aes(colour = "versicolor"), linewidth = 2,
                args = list(mean = mean(iris$Sepal.Width[51:100]), 
                            sd = sd(iris$Sepal.Width[51:100]))) +
  stat_function(fun = dnorm, aes(colour = "virginica"), linewidth = 2,
                args = list(mean = mean(iris$Sepal.Width[101:150]), 
                            sd = sd(iris$Sepal.Width[101:150]))) +
  scale_colour_manual("Species", values = c(setosa = "red", versicolor = "green", virginica = "blue"))

3.4 View each species separately

In ggplot2 we can view separately the attributes of the distribution of different species in histograms.

# R
ggplot(iris, aes(x= Sepal.Width, fill = Species)) +
  geom_histogram(bins=22) + 
  facet_wrap(~Species)

We can do the same and draw the normal curve for each species.

# R
ggplot(iris, aes(x = Sepal.Width, fill = Species)) +
  geom_histogram(aes(y = after_stat(density)), bins = 22) + 
  facet_wrap(~Species) + 
  geom_line(aes(y = dnorm(Sepal.Width,
                          mean = tapply(Sepal.Width, Species, mean, na.rm = TRUE)[PANEL],
                          sd = tapply(Sepal.Width, Species, sd, na.rm = TRUE)[PANEL])))